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INTRODUCTON 


The  evaluation  of  natural  language  processing  (NLP)  ^tems  is  essential  for  continued, 
steady  progress  In  the  development  and  application  of  NLP  technology.  Its  value  encompasses 
the  lifecycle  of  individual  projects  (steering  the  course  of  individual  system  developments  and 
providing  specifications  for  placing  systems  Into  service)  and  the  spectrum  from  individual 
projects  to  the  whole  of  the  technology.  For  particular  system  development  projects,  the 
Identification  of  strengths  and  weaknesses  Is  necessary  In  order  to  chart  progress,  guide  the 
course  of  evolution,  and  provide  feedback  into  research  and  development  cycles.  With  regard  to 
applications,  researchers  must  be  able  to  measure  the  effectiveness  of  NLP  sterns  and  ^tem 
components  in  order  to  appropriately  combine  them  with  other  interface  technologies  and 
match  them  to  the  characteristics  of  specific  tasks  and  user  requirements  In  applications. 
BQTond  the  value  to  particular  projects,  a  common  and  consistent  basis  for  measurement, 
description,  and  comparison  encourages  the  technical  exchange  and  commingling  of  theories 
and  Ideas  that  will  be  required  for  the  science  of  NLP  technology  to  advance. 

Interest  In  developing  standard  NLP  evaluation  methodologies  is  growing  as  the  technology 
matures.  Although  there  Is  keen  Interest  in  the  problem  of  evaluation,  there  are  no  clear  and 
obvious  answers  to  quations  regarding  the  basis  of  evaluation  (e.g.,  task,  corpus,  capabilities), 
evaluation  methods  (black  box  versus  glas.^  box),  metrics  (e.g.,  recaU.  precision,  averages, 
percentages,  statistical  measures),  or  the  format  and  content  of  reported  results  (e.g.,  numerical 
scores,  descriptions,  profiles). 

Corpus-based,  task-baaed,  and  capabUity-based  evaluation  are  three  types  of  Nli* 
evaluation.  For  corpus-based  evaluation,  a  fixed  body  (or  "corpus")  of  inputs  is  given  to  the 
system  and  measurements  are  made  based  on  the  lystem  outputs.  A  number  of  standard 
corpuses  for  NL  understanding  system  evaluation  are  in  use  today  ([BBN,  1988):  (Flickinger  et. 
al..  1987);  [Hendrix.  Sacerdctl,  and  Slocum  1976);  [Malhotra,  1975D>  In  a  task-based 
evaluation,  a  specific  task  or  tasks  are  to  be  completed  using  the  system.  The  basis  for 
judgement  is  how  well  the  task  was  accomplished.  Capability-based  (or  checkUst-based) 
evaluation  uses  a  list  of  IruUvldual  capabilities  to  guide  the  evaluation.  During  evaluatkoi 
each  capability  is  assigned  an  indicator  of  whether  or  not  (or,  to  what  degree)  the  system 
demonstrates  that  capability. 

NLP  System  evaluations  can  occtir  at  different  levels  of  "detail".  Black  Box  evaluation 
focuses  on  what  a  system  does,  measuring  performance  based  on  well-defined  tnput/output 

*  •  * 

III 


pairs  without  concern  for  how  the  system  processes  the  input  and  generates  its  output.  In 
contrast,  a  glass  box  evaluation  "looks  into"  the  system  and  examines  the  particulars  of  how  it 
works.  In  a  sense,  the  distinction  between  black  and  glass  box  evaluations  can  be  a  matter  of 
perspective:  a  black  box  evaluation  of  one  or  more  components  of  a  particular  system  could  be 
considered  a  glass  box  evaluation  from  the  ^tem  level  perspective. 

Further  complications  arise  when  one  considers  that  NLP  ^tems  evaluation  can  be 
performed  from  the  perspective  of  the  NLP  system  developer,  that  of  the  application  system 
developer,  or  that  of  the  end-user.  Depending  on  one's  perspective,  varying  levels  of  concern 
will  be  focused  on  issues  such  as  system  development  cost,  portability,  extensibility, 
maintainability,  linguistic  functionality  in  the  areas  of  syntax,  semantics,  discourse, 
pragmatics,  and  knowledge  acquisition:  reliability:  help  facilities:  performance  speed:  and 
robustness.  The  papers  presented  in  this  collection  focus  on  evaluation  of  linguistic 
functlorudlty  of  NLP  systems.  In  some  instances  on  the  int^ratlon  of  functionality  evaluation 
during  the  development  process. 

Workshop  Description 

The  Natural  Language  Processing  Systems  Evaluation  Worieshop  provided  a  forum  for 
computatlorud  linguists  to  discuss  current  evaluation  efforts  and  activities,  research  progress, 
and  new  approaches:  promoted  scientlfk:  Interchange  on  important  evaluation  issues:  and 
generated  recommendations  and  directions  for  future  investigations  in  the  evaluation  area. 
The  Workshop,  sponsored  by  Rome  Laboratory  and  attended  by  over  60  people,  was  held  in 
conjunction  with  the  29th  Armual  Meeting  of  the  Association  for  Computational  Linguistics 
(ACU  on  18  June  1991  at  the  Berkeley  Campus  of  the  University  of  California. 

The  Workshop  Call  for  Participation  sought  presentations  focused  on  evalixation-relevant 
issues  that  include:  the  Identification  of  NLP  capabilities  requiring  'hieasurement”.  evolving 
or  contrived  evaluation  criteria,  and  descriptions  of  current  evaluation  practices  and 
experiences.  The  papers  in  this  volume  are  formal  records  of  the  pmentatlons  made  at  the 
Workshop. 

DARPA-sponsored  Message  Understanding  Conferences  form  a  major  program  toward 
resolving  issues  of  text  understanding  system  evaluattoa  The  original  confeier¥».  in  1987. 
had  participants  training  and  testlrrg  their  afystems  on  Navy  RAINFORM  messages.  Ten 
messages  made  up  the  training  set  and  two  additional  messages  were  distributed  when  the 
participants  assembled  at  the  Naval  Ocean  Systems  Center  (NOSC)  for  system  extension  and 
evaluation.  The  second  Message  Understanding  Conference,  in  1989,  presented  105  Navy 
OPREP  messages  for  training  and  25  for  testing,  Recentty.  MUC-3  rrsed  a  training  set  of  1300 


torts  forming  a  corpus  of  over  2.5  MB  and  a  test  set  of  100  previously  unseen  torts.  The  paper 
"Third  Message  Understanding  Evaluation  and  Conference  (MUC-3):  Methodology  and  Test 
Results ■'  presents  the  activities  and  results  of  the  most  recent  encounter,  held  at  NOSC  In  San 
Diego.  California,  on  21-23  May  1991.  "MUC-3  Evaluation  Metrics  and  Linguistic  Phenomena 
Tests"  delves  Into  the  details  of  the  measurements  made  during  that  event. 

The  paper  "A  Developing  Methodology  for  the  Evaluation  of  Spoken  Language  Systems" 
discusses  a  program,  supported  by  DAHPA  and  the  National  Institute  of  Science  and 
Technology'  (NIST),  for  evaluating  Spoken  Lairguage  Systems  (SLS)  on  a  database  query  task 
using  the  Air  Travel  Information  System  (ATIS).  Both  the  ATTS  and  MUC  evaluation  programs 
collected  a  large  data  set.  reserved  a  portion  for  testing  purposes  and  used  the  major  portion  for 
training,  developed  agreement  on  judging  correct  system  outputs,  distributed  the  test  set  for 
administration  at  multiple  sites,  used  automated  scoring  techniques,  and  report  numerical 
scores  (recall  and  precision  for  MUC-3:  number  right,  wrong,  and  not  answered  as  well  as  a 
weighted  error  percentage  for  ATIS).  The  paper  "Multi-Site  Natural  Language  Processing 
Evaluation:  MUC  and  ATIS"  further  compares  and  contrasts  the  two  evaluation  programs. 

In  contrast  to  the  task-specific  and  domain-dependent  approach  to  evaluation  used  In  MUC 
and  ATIS.  "The  Benchmark  InvesUgaUon/IdentdlcaUon  Program"  discusses  a  method  of 
evaluating  NLP  systems  that  is  being  designed  for  applicabUlly  across  task  types,  across 
domains,  and  to  all  different  types  of  NLP  systems.  The  developing  evaluation  methodology 
will  not  require  the  NLP  systems  to  be  re-engineered  or  adapted  to  a  particular  domain  or  text 
corpus,  and  will  provide  coverage  of  NLP  capabilities  across  the  categories  of  syntax, 
semantics,  discourse,  and  pragmatics.  Furthermore,  the  method  provides  detailed  capability 
profllcs  of  the  NLP  systems  evaluated,  with  quantitative  results  for  each  item  In  the  profile. 
The  authors  report  on  the  development  of  the  evaluation  procedure  and  the  results  of  actual 
uss  of  the  evaluation  procedure  with  three  NLP  systems. 

“Evaluating  Syntax  Performance  of  Parser/Crammars  of  English"  reports  on  a 
collaborative  effort  on  the  part  of  representatives  of  nine  Institutions  to  develop  criteria, 
methods,  measures  and  procedures  for  evaluating  the  syntax  performance  of 
parser/grammars.  The  project  has  progressed  to  the  point  where  the  first  version  of  an 
automated  syntax  evalu^uau  procedure  has  been  completed  and  Is  available  for  testing.  The 
procedure  Judges  a  parse  only  c-n  the  basis  of  constituent  boundaries  aixi  yields  two  principal 
measures:  crossing  parentheses  and  recall. 

The  paper  "A  Diagnostic  Tool  for  German  Syntax"  describes  an  effort  to  construct  a 
catalogue  of  syntactic  data  which  should  eventually  exemplify  the  major  i^ntactic  patterns  of 
the  German  language.  The  data  set  la  being  systematfcally  constructed,  rather  than  collected 
from  naturally  occurring  text,  so  as  to  assure  coverage  of  syntactic  phenomena  and  optlmla 
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the  data  set.  In  the  future,  the  data  set  will  be  used  for  error  detection  and  system  evaluaUon. 
The  effort  has  seme  similarity  to  the  Benchmark  Investigation/ Identification  program, 
mentioned  above.  In  that  the  emphasis  is  on  linguistic  capabilities,  categorized  in  a  systematic 
manner,  using  "hand-constructed"  test  data. 

The  next  two  papers  discuss  evaluation  of  semantic  analysis  and  dialogue  performance, 
respectively.  They  do  not  report  on  methods  that  have  been  Implemented  but  rather,  discuss 
important  Issues  and  propose  approaches  to  the  evaluation  process.  "Preliminaries  to  the 
Development  of  Evaluation  Metrics  lor  Natural  Language  Semantic  and  Pragmantic  Analysis 
Systems"  discusses  the  categorization  of  semantic  and  pragmatic  analysis  capabilities  that 
must  precede  the  development  of  test  suites  or  testing  methods.  The  paper  "Corpus-Based 
Evaluation  of  the  Sundial  System"  discusses  the  importance  that  has  been  placed  on  the  use  of 
corpus  collection  arid  analysis  In  the  development  and  evaluation  of  the  Sundial  System.  The 
Sundial  development  project  Is  a  multi-national  European  Spoken  Language  project. 

Although  the  Sundial  development  group  has  not  yet  implemented  a  method  of  evaluating 
dialogue  management  performance,  the  author  discusses  the  issues  and  proposes  pnmaiy 
criteria  for  evaluation. 

The  papers  "Module-Level  Testing:  Lessons  Learned"  and  "Maintaining  and  Enhancing  a 
NL  DBMS  Interface"  address  the  issue  of  evaluation  for  the  purpose  of  supporting  the  on-going 
development  of  a  particular  system.  The  first  stresses  the  use  of  module-level  testing  on 
carefully  constructed  test  suites  that  are  representative  of  the  actual  corpus  the  system  must 
handle  and  a  scoring  methodolc^  that  provided  sufficient  Information  to  support  :^em 
Improvement  and  continued  development.  "Maintaining  and  Enhancing  a  NL  DBMS 
Interface"  discusses  a  maintenance  facility  that  includes  a  large  test  query  collection  organized 
around  their  NLP  system's  general  lexicon  of  words  and  phrases.  Although  the  system  Is  based 
on  Conceptual  Analysis  and  does  not  have  a  syntactic  analysis  module  like  the  system  of  the 
previous  paper,  the  maintenance  facility  docs  provide  Information  concerning  certain  stages 
of  processing  via  the  storage  of  intermediate  data  structures.  Both  papers  "speak  from 
experience"  and  provide  good  insights  into  the  problems  of  controlling  and  tracking  the 
development  of  large  NLP  systems. 

Evaluation  programs  for  natural  language  generation  (NLG)  technology  lag  behind  those 
for  NLU,  having  neither  funding  nor  a  concensus  on.  evaluation  criteria  and  methodology. 
Issues  hindering  NLG  evaluation  are  discussed  In  "Evaluation  for  Generation”  arid 
"Evaluating  Natural  Language  Generation  Facilities  In  Intelligent  Systems".  The  Impetus  for. 
and  the  problems  surrounding  determination  of,  an  NLG  evaluation  methodology  are 
summarized  in  the  first  of  these  papers.  The  second  NLG  paper  discusses  the  problems 
associated  with  devising  a  standard  set  of  NLG  evaluation  criteria  and  mpports  the  use  of  a 
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task-based  approach  instead.  The  paper  outlines  the  approach  In  the  context  of  a  current 
study. 

A  transfer-based  (or  direct  transfer)  Machine  Translation  system  translates  from  one 
human  language  to  another  without  using  an  Intermediate  representation  language.  Transfer- 
based  systems  select  word  translations  and  then  use  rules  to  adjust  the  result  of  those  word 
replacements  to  form  a  sentence  In  the  target  language.  MT  ^tems  with  "good 
composltlonallty"  are  those  requiring  fewer  adjustment  rules.  The  paper  "Measuring 
Composltlonallty  In  Transfer-Based  Machine  Translation  Systems"  describes  a  methodology 
applied  to  objectively  evaluate  the  composltlonallty  of  a  transfer-based  MT  system. 

Next,  under  the  Intriguing  title  of  "Good  .^plications  for  Crummy  Machine  Translation," 
prevailing  MT  metrics  are  reviewed  and  a  convincing  argument  is  presented  endorsing  the  use 
of  task-dependent  evaluation  criteria.  To  demonstrate  this  approach,  the  appropriateness  of 
various  metrics  for  Machine  (Aided)  Translation  for  a  range  of  applications  is  noted. 

Our  final  paper  In  the  Proceedings  contrasts  with  the  others  In  that  It  reports  on  a 
community  resource  and  addresses  the  evaluation  of  NLP  systems  based  on  criteria  that  differ 
from  these  of  the  other  papers.  "Gross-Grained  Software  Evaluation:  The  Natural  Language 
Software  Registry"  reports  on  the  NL  Software  Registry  sponsored  by  the  Association  for 
Computational  Linguistics  and  recently  established  at  the  University  of  Chicago's  Center  for 
Information  and  Language  Studies.  Its  purpose  Is  to  facilitate  the  exchange  and  evaluation  of 
NLP  software,  both  commercial  and  noncommercial.  A  concise  and  uniform  summary  of  the 
33  software  items  that  have  been  submitted  has  provided  a  better  understanding  of  some 
current  evaluation  criteria.  This  gross-grained  approach  to  evaluation  was  supplemented  by 
extensive  testing  and  review  of  selected  software. 

Summary/Concluslons 

The  Workshop  well-achieved  Its  goal  of  gathering  researchers  representing  a  wide  range  of 
NLP-related  Interests.  The  papers  and  active  support  of  the  workshop  participants  attest  to 
research  community  awareness  that  NLP  technotogy,  having  matured  to  the  point  of  utility  for 
certain  circumscribed  applications,  has  a  growing  need  for  the  formulation  of  evaluation 
methodologies. 

We  were  fortunate  to  have  one  of  the  earliest  reports  on  the  results  of  MUC-3.  Having  taken 
the  Initiative  in  developing  evaluation  methodologies,  MUC  events  have  played  an  Inestimable 
role  In  the  development  of  criteria  and  methodology  for  text  processing  systems.  Comments 
from  MUC-3  participants  reinforce  the  notion  of  value  In  evaluating  NLP  systems  for 
stimulating  ideas  and  Identifying  high  payoff  processing  techniques  or  problem  areas.  Results 


indicate,  for  example,  that  the  majority  of  the  high-scoring  systems  In  MUC-3  use  robust 
parsing.  And.  when  asked  to  Identify  the  key  problem  their  system  faced,  that  is.  where  their 
energies  would  best  be  focused  during  the  next  year  (until  MUC-4).  “discourse  anatysls"  was  the 
popular  response  [Iwanska  et.  al.,  19911. 

In  reading  the  Workshop  papers,  the  relationship  between  the  efforts  cataloging  specific 
NLU  capabilities  and  the  optional  apposltlves  test  described  In  the  second  of  the  two  MUC-3 
papers  should  be  noted.  There  may  be  synergistic  benefits  to  be  gained  from  the  Insinuation  of 
such  cataloging  efforts  Into  MUC-4  or  similar  evaluation  activities  of  the  future. 

Many  new  evaluation  projects  and  activities  have  begun  since  the  December  1988 
Workshop  on  the  ElvaJuatlon  of  NLP  Systems  held  at  Wayne,  PA  (Palmer,  et  al;  1989).  The 
response  to  the  call  within  the  report  for  that  workshop  for  “rigorous  accounts"  of  linguistic 
phenomena,  such  as  that  offered  for  discourse  by  [Webber,  1988),  is  strongly  reflected  within 
this  vcl'.une.  Also,  large  collections  of  text  and  other  resources  (such  as  the  Software  Registry 
described  !n  the  last  paper  here)  have  been  recognized  as  crucial  and  cost-effective  for  future 
NLP  development  and  evaluation  efforts.  Workshop  discussion  was  Interspersed  with  mention 
of  a  number  of  text  coDectlon  initiatives  (which  were  more  fully  described  later  in  the  week  to 
ACL  attendees  during  the  business  meeting).  For  example,  recognizing  the  need  for  large 
amounts  of  linguistic  data  for  speech  and  text  processing  research,  DARPA  Is  In  the  process  of 
establishing  the  Linguistic  Data  Consortium  (LDC)  to  develop  and  distribute  speech  and  text 
corpuses,  lexicons,  and  grammars.  The  goal  of  the  ACL  Data  Collection  Initiative  (ACL/DCI)  to 
acquire  large  text  corpuses  with  a  total  of  over  100  million  words  has  been  surpassed;  and,  the 
TREEBANK  Project  at  the  University  of  Pennsylvania  operates  to  provide  linguistic  analyses 
of  vast  quantities  of  spoken  and  written  text.  Further  Information  on  the  ACL/DCI,  the 
TREEBANK  Project,  and  the  Text  Encoding  InltlaUvc  fTEI).  an  International  project  to 
“formulate  and  disseminate  guidelines  for  the  preparation  and  interchange  of  machine- 
readable  texts,"  can  be  found  In  (Walker,  19901. 

Workshop  attendees  commented  on  the  lack  of  MUC  like  evaluation  programs  for  NLG 
technology.  This  deficiency  was  attributed  to  insufficient  funding  support  and  the  lack  of  an 
evaluation  methodology.  Elfforts  toward  defining  feasible  measures  and  testing  methods  are 
being  made,  however,  as  evidenced  by  the  mention  In  the  enclosed  paper,  "Evahiatkm  for 
Generation",  of  recent  workshops  focused  on  those  goals. 

Workshop  participants  concluded  that  some  Important  and  appropriate  steps  are  being 
taken,  but  much  more  remains  to  be  done  toward  developing  evaluation  methodologies  for 
NLP.  A  fact  to  be  kept  In  mind  is  that  evaluation  standards  are  not  created  or  defined  by  decree 
but  must  evolve  and  earn  community  acceptance  over  time.  The  work  described  In  the  papers 
in  this  Proceedings  are  taking  some  critical  steps  In  that  process.  Workshop  participants,  in 
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particular,  the  authors  of  this  introductory  paper,  seek  feedback  from  the  research  community 
on  their  projects  and  research  directions.  We  encourage  reader  comments. 

Our  sincere  thanks  go  to  all  who  participated  in  the  Workshop;  the  Workshop 
speakers/authors;  fellow  members  of  the  Workshop  Organizing  Committee,  namely.  Christine 
Montgomery.  Tim  Flnln.  and  Ralph  Grlshman:  the  Association  for  Computational  Linguistics, 
particularly  Peter  Norvlg  and  Don  Walker:  and  the  University  of  California  at  Berkeley. 
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INTRODUCTION 

The  Naval  Ocean  Systems  Center  (NOSC)  has  conducted  the  third  in  a  scries  of 
evaluations  of  English  text  analysis  systems.  These  evaluations  arc  intended  to 

advance  our  understanding  of  the  merits  of  current  text  analysis  techniques,  as 
applied  to  the  performance  of  a  realistic  information  extraction  task.  The  latest 

one  is  also  intended  to  provide  insight  into  information  retrieval  technology 
(document  retrieval  and  categorization)  used  instead  of  or  in  concert  with 
language  understanding  technology.  The  inputs  to  the  analysis/extraction  process 
consist  of  naturally-occurring  texts  that  were  obtained  in  the  form  of  electronic 
messages.  The  outputs  of  the  process  are  a  set  of  templates  or  semantic  frames 

resembling  the  contents  of  a  partially  formatted  database. 

The  premise  on  which  the  evaluations  arc  based  is  that  task-oriented  tests 
enable  straightforward  comparisons  among  systems  and  provide  useful 
quantitative  data  on  the  state  of  the  art  in  text  understanding.  The  tests  arc 
d^'signed  to  treat  the  systems  under  evaluation  as  black  boxes  and  to  point  up 
system  performance  on  discrete  aspects  of  the  task  as  well  as  on  the  task  overall. 
These  quantitative  data  can  be  interpreted  in  light  of  information  known  about 
each  system's  text  analysis  techniques  in  order  to  yield  qualitative  insights  into  the 
relative  validity  of  those  techniques  as  applied  to  the  general  problem  of 
information  extraction. 

This  paper  presents  an  overview  of  the  evaluation  and  its  results.  A  MUC-3 

conference  proceedings  will  be  published  by  Morgan  Kaufraann  that  includes  the 

complete  set  of  overall  test  scores,  some  analysis  by  the  panicipants  of  their  test 

results,  and  system  descriptions. 

OVERVIEW 

The  third  evaluation  began  in  October,  1990.  A  dry-run  phase  was  completed  in 
February,  1991,  and  final  testing  was  carried  out  in  May,  1991,  concluding  with  the 
Third  Message  Understanding  Conference  (MUC-3).  This  evaluation  was 
significantly  broader  in  scope  than  previous  ones  in  most  respects,  including  text 
characteristics,  task  specifications,  performance  measures,  and  range  of  text 
understanding  and  information  extraction  techniques.  The  corpus  and  task  are 

sufficiently  challenging  that  they  will  be  used  again  (with  a  new  test  set)  in  a 


future  evaluation,  which  will  seek  to  measure  improvements  in  performance  by 
MUC-3  systems  and  establish  performance  baselines  for  any  new  systems. 

A  cal!  for  participation  was  sent  to  organizations  in  the  U.S.  that  were  known  to 
be  engaged  in  system  design  or  development  in  the  area  of  text  analysis  or 
information  retrieval.  Twelve  of  the  sites  that  responded  participated  in  the  dry 
run  and  reponed  results  at  a  meeting  held  in  February,  1991,  which  also  served  as 
a  forum  for  resolving  issues  that  affected  the  test  design,  scoring,  etc.  for  the  final 
testing  in  May.  One  site  dropped  out  after  the  dry  run,  and  four  new  sites  entered. 
The  Defense  Advanced  Research  Projects  Agency  (DARPA),  which  funded  NOSC  to 
conduct  the  evaluation,  also  provided  panial  financial  support  to  two-thirds  of  the 
participating  sites. 

Pure  and  hybrid  systems  based  on  a  wide  range  of  text  interpretation 
techniques  (e.g.,  statistical,  key-word,  template-driven,  pattern-matching,  in- 
depth  natural  language  processing)  were  represented  in  the  MUC-3  evaluation. 
The  fifteen  sites  that  completed  the  evaluation  are  Advanced  Decision  Systems 
(Mountain  View,  CA),  BBN  Systems  and  Technologies  (Cambridge,  MA),  General 
Electric  (Schenectady,  NY),  GTE  (Mountain  View,  CA),  Intelligent  Text  Processing, 
Inc.  (Santa  Monica.  CA),  Hughes  Research  Laboratories  (Malibu,  CA),  Language 
Systems.  Inc.  (Woodland  Hills.  CA),  McDonnell  Douglas  Electronic  Systems  (Santa 
Ana,  CA).  New  York  University  (New  York  City,  NY),  PRC,  Inc.  (McLean,  VA),  SRI 
International  (Menlo  Park,  CA),  Synchronetics,  Inc.  together  with  the  University 
of  Maryland  (Baltimore,  MD),  Unisys  Center  for  Advanced  Information  Technology 
(Paoli,  PA),  the  University  of  Massachusetts  (Amherst,  MA),  and  the  University  of 
Nebraska  (Lincoln,  NE)  in  association  with  the  University  of  Southwest  Louisiana 
(Lafayette,  LA).  In  addition,  an  experimental  prototype  of  a  probabilistic  text 
categorization  system  was  developed  by  David  Lewis,  who  is  now  at  the  University 
of  Chicago,  and  was  tested  along  with  the  other  systems. 

CORPUS  AND  TASK 

The  corpus  was  formed  via  a  keyword  query  to  an  electronic  database 
containing  articles  in  message  format  that  had  originated  from  open  sources 
worldwide.  These  anicles  had  been  compiled,  translated  (if  necessary),  edited,  and 
disseminated  by  the  Foreign  Broadcast  Information  Service  of  the  U.S.  Government. 
A  training  set  of  1300  texts  was  identified,  and  additional  texts  were  set  aside  for  use 
as  test  data.  The  corpus  presents  realistic  challenges  in  terms  of  its  overall  size 
(over  2.5  megabytes),  the  length  of  the  individual  articles  (approximately  a  half¬ 
page  each  on  average),  the  variety  of  text  types  (newspaper  anicles,  TV  and  radio 
news,  speech  and  interview  transcripts,  rebel  communiques,  etc.),  the  range  of 
linguistic  phenomena  represented  (both  well-formed  and  ill-formed),  and  the 
open-ended  nature  of  the  vocabulary  (especially  with  respect  to  proper  nouns). 

The  task  was  to  extract  information  on  terrorist  incidents  (incident  type,  date, 
location,  perpetrator,  target,  instrument,  outcome,  etc.)  from  the  relevant  texts  in 
a  blind  test  on  100  previously  unseen  texts.  Approximately  half  of  the  articles  were 
irrelevant  to  the  task  as  it  was  defined;  scoring  penalties  were  exacted  for  failures 
to  correctly  determine  relevancy  (sec  following  section).  The  extracted 
information  was  to  be  represented  in  a  template  in  one  of  several  ways,  according 
to  the  data  format  requirements  of  each  template  slot.  Some  slot  fills  were  required 
to  be  categories  from  a  predefined  set  of  possibilities  (e.g.,  for  the  various  types  of 
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lerrorist  incidents  such  as  BOMBING.  ATTEiMPTED  BO.MBING,  BOMB  THREAT); 
others  were  required  to  be  canonicalized  forms  (c.g.,  for  dates)  or  numbers;  still 

others  were  to  be  in  the  form  of  strings  (e.g.,  for  person  names). 

-A  relatively  simple  article  and  corresponding  answer-key  template  arc  shown 
in  Figures  1  and  2.  Note  that  the  text  in  Figure  1  is  all  upper  case,  that  the  dateline 
includes  the  source  of  the  article  ("Inravision  Television  Cadena  1")  and  that  the 
article  is  a  news  report  by  Jorge  Alonso  Sierra  Valencia.  In  Figure  2.  the  left-hand 
column  contains  the  slot  labels,  and  the  right-hand  column  contains  the  correct 
answers  as  defined  by  NOSC.  Slashes  mark  alternative  correct  responses  (systems 
are  to  generate  just  one  of  the  possibilities),  an  asterisk  marks  slots  that  are 
inapplicable  to  the  incident  type  being  reported,  a  hyphen  marks  a  slot  for  which 

the  text  provides  no  fill,  and  a  colon  introduces  the  cross-reference  portion  of  a  fill 
(except  for  slot  16,  where  the  colon  is  used  as  a  separator  between  more  general 
and  more  specific  place  names). 

The  participants  collectively  created  the  answer  key  for  the  training  set,  each 
site  manually  filling  in  templates  for  partially  overlapping  subset  of  the  texts. 
This  task  was  carried  out  at  the  start  of  the  evaluation;  it  therefore  provided 
participants  with  good  training  on  the  task  requirements  and  provided  NOSC  with 
good  early  feedback.  Generating  and  cross-checking  the  templates  required  an 
investment  of  at  least  two  person-weeks  of  effort  per  site.  These  answer  keys  were 
updated  a  number  of  times  to  reduce  errors  and  to  maintain  currency  with 
changing  template  fill  specifications.  In  addition  to  generating  answer  key 

templates,  sites  were  also  responsible  for  compiling  a  list  of  the  place  names  that 

appeared  in  their  set  of  texts;  NOSC  then  merged  these  lists  to  create  the  options  for 
filling  the  TARGET;  FOREIGN  NATION  slot  and  LOCATION  OF  INCIDENT  slot. 


TST1-MUC3-0080 

BOGOTA.  3  APR  90  (INRAVISION  TELEVISION  CADENA  1)  -  [REPORT]  [JORGE  ALONSO 
SIERRA  VALENCIA)  [TEXT]  LIBERAL  SENATOR  FEDERICO  ESTRADA  VELEZ  WAS 
KIDNAPPED  ON  3  APRIL  AT  THE  CORNER  OF  60TH  AND  48TH  STREETS  IN  WESTERN 
MEDELLIN.  ONLY  100  METERS  FROM  A  METROPOLITAN  POUCE  CAI  [IMMEDIATE 
ATTENTION  CENTER].  THE  ANTIOQUIA  DEPARTMENT  LIBERAL  PARTY  LEADER  HAD 
LEFT  HIS  HOUSE  WITHOUT  ANY  BODYGUARDS  ONLY  MINUTES  EARLIER.  AS  HE  WAITED 
FOR  THE  TRAFRC  LIGHT  TO  CHANGE,  THREE  HEAVILY  ARMED  MEN  FORCED  HIM  TO  GET 
OUT  OF  HIS  CAR  AND  GET  INTO  A  BLUE  RENAULT. 

HOURS  LATER,  THROUGH  ANONYMOUS  TELEPHONE  CALLS  TO  THE  METROPOLTTAN 
POLICE  AND  TO  THE  MEDIA,  THE  EXTRADITABLES  CLAIMED  RESPONSIBILTrY  FOR  THE 
KEDNAPPINO.  IN  THE  CALLS,  THEY  ANNOUNCED  THAT  THEY  WILL  RELEASE  THE 
SENATOR  WITH  A  NEW  MESSAGE  FOR  THE  NATIONAL  GOVERNMENT. 

LAST  WEEK,  FEDERICO  ESTRADA  VELEZ  HAD  REJECTED  TALKS  BETWEEN  THE 
GOVERNMENT  AND  THE  DRUG  TRAFHCKERS. _ _ 


Figure  1.  Article  from  MUC-3  Corpui^ 


'This  article  has  serial  number  PA0404072690  in  the  Latin  America  volume  of  the  Foreign 
Broadcast  Information  Service  Daily  Reports,  which  arc  the  secondary  source  for  all  the  texts  in 
the  MUC-3  corpus. 
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0.  MESSAGE  ID 

TST1-MUC3-0080 

1.  TEMPLATE  ID 

1 

2.  DATE  OF  INCIDENT 

03  APR  90 

3.  TYPE  OF  INCIDENT 

KIDNAPPING 

•J.  CATEGORY  OF  INCEDE.ST 

TE.RRORIST  ACT 

5.  PERPETRATOR:  ID  OF  LVDIV(S) 

•THREE  HEAVILY  ARMED  MEN" 

6.  PERPETR.ATOR:  ID  OF  ORG(S) 

"TKE  EXTRADITABLES"  /  ••EXTRADITABLES' 

7.  PERPETRATOR:  CONTIDENCE 

CLAIMED  OR  ADMITTED:  THE  EXTRADITABLES"  / 
•EXTRADITABLES" 

8.  PHYSICAL  TARGET:  ID(S) 

9.  PHYSICAL  TARGET:  TOTAL  NTFM 

« 

10.  PHYSICAL  TARGET:  TYPE(S) 

• 

11.  HUMAN  TARGET:  ID(S) 

••FEDERICO  ESTRADA  VELEZ"  ("LIBERAL  SENATOR"  / 
"ANTIOQUIA  DEPARTMENT  LIBERAL  PARTY  LEADER" 

/  "SENATOR"  /  "LIBERAL  PARTY  LEADER"  /  "PARTY 
LEADER") 

12.  HUMAN  TARGET:  TOTAL  NUM 

1 

13.  HUMAN  T/\RGET:  TYPE(S) 

GOVERNMENT  OFnCIAL  /  POLmCAL  FIGURE; 
•'FEDERICO  ESTRADA  VELEZ" 

14.  TARGET:  FOREIGN  NATION(S) 

- 

15.  INSTRUMENT:  TYPE(S) 

16.  LOCATION  OF  INCIDENT 

COLOMBIA:  MEDELLIN  (CITY) 

17.  EFFECT  ON  PHYSICAL  TARGET(S) 

• 

18.  EFFECT  ON  HUMAN  TARGET(S) 

- 

Figure  2.  Answer  Key  Template 

MEASURES  OF  PERFORMANCE 

All  systems  were  evaluated  on  the  basis  of  performance  on  the  information 

extraction  task  in  a  blind  test  at  the  end  of  each  phase  of  the  evaluation.  It  was 

expected  that  the  degree  of  success  achieved  by  the  different  techniques  in  May 

would  depend  on  such  factors  as  whether  the  number  of  possible  slot  fillers  was 
small,  finite,  or  open-ended  and  whether  the  slot  could  typically  be  filled  by  fairly 
straightforward  extraction  or  not.  System  characteristics  such  as  amount  of 
domain  coverage,  degree  of  robustness,  and  general  ability  to  make  proper  use  of 

information  found  in  novel  input  were  also  expected  to  be  major  factors.  The  dry- 
run  test  results  were  not  assumed  to  provide  a  good  basis  for  estimating 
performance  on  the  final  test  in  May,  but  the  expectation  was  that  most,  if  not  all, 
of  the  systems  that  participated  in  the  dry  run  would  show  dramatic  improvements 
in  performance.  The  test  results  show  that  some  of  these  expectations  were  borne 
out,  while  others  were  not  or  were  less  significant  than  expected. 

A  semi-automated  scoring  program  was  developed  under  contract  for  MUC-3  to 
enable  the  calculation  of  the  various  measures  of  performance.  It  wu  distributed 

to  panicipants  early  on  during  the  evaluation  and  proved  invaluable  in  providing 

them  with  the  performance  feedback  necessary  to  prioritize  and  reprioritize  their 
development  efforts  as  they  went  along.  The  scoring  program  can  be  set  up  to 
score  all  the  templates  that  the  system  generates  or  to  score  subsets  of 

templates/slots.  User  interaction  is  required  only  to  determine  whether  a 

mismatch  between  the  system-generated  templates  and  the  answer-key  templates 
should  be  judged  completely  or  partially  correct.  (A  partially  correct  filler  for  slot 
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11  in  Figure  2  might  be  "V’ELEZ"  ("LEADER"),  and  a  partially  correct  filler  for 
slot  16  would  be  simply  COLOMBIA.)  An  extensive  set  of  interactive  scoring 
guidelines  \^3S  developed  to  standardize  the  interactive  scoring.  The  scoring 
program  maintains  a  log  of  interactions  that  can  be  used  in  later  scoring  runs  and 
aug.mcr.'.cd  bv  the  user  as  the  s\stcm  is  updated  and  the  system-generated  templates 
change. 

The  two  primary  measures  of  performance  were  completeness  (recall)  and 
accuracy  (precision).  There  were  two  additional  measures,  one  to  isolate  the 
amount  of  spurious  data  generated  (overgencration)  and  the  other  to  determine 
the  rate  of  incorrect  generation  as  a  function  of  the  number  of  opportunities  to 
incorrectly  generate  (fallout).  The  labels  "recall."  "precision."  and  "fallout"  were 
borrowed  from  the  field  of  information  retrieval,  but  the  definitions  of  those  terms 
had  to  be  substantially  modified  to  suit  the  template-generation  task.  The 

overgeneration  metric  has  no  correlate  in  the  information  retrieval  field,  i.e.,  a 

MUC-3  system  can  generate  indefinitely  more  data  than  is  actually  called  for.  but 
an  information  retrieval  system  cannot  retrieve  more  than  the  total  number  of 
items  (e.g..  documents)  that  are  actually  present  in  the  corpus. 

Fallout  can  be  calculated  only  for  those  slots  whose  fillers  form  a  closed  set. 
Scores  for  the  other  three  measures  were  calculated  for  the  test  set  overall,  with 
breakdowns  by  template  slot.  Figure  3  presents  a  somewhat  simplified  set  of 

definitions  for  the  measures. 


MEASURE 

DEFINITION 

RECALL 

PRECISION 

ocprrect  fills  generated 
#fills  generated 

OVERGENERATION 

^spurious  fills  generated 
Itinerated 

FALLOUT 

»incQrrect-*-5PuriQus _ gcnciaicd 

^possible  incorrect  fills 

Figure  3.  MUC-3  Scoring  Metrics 

The  most  significant  thing  that  this  table  does  not  show  is  that  precision  and  recall 

are  actually  calculated  on  the  basis  of  points  -  the  term  "correct"  includes  system 

responses  that  matched  the  key  exactly  (earning  1  point  each)  and  system 

responses  that  were  judged  to  be  a  good  partial  match  (earning  .5  point  each).  It 

should  also  be  noted  that  overgeneration  is  not  only  a  measure  in  its  own  right  but 
is  also  a  component  of  precision,  where  it  acts  as  a  penalty  by  contributing  to  the 
denominator.  Overgeneration  also  figures  in  fallout  by  contributing  to  the 
numerator.  Further  information  on  the  MUC-3  evaluation  metrics,  including 
information  on  three  different  ways  penalties  for  missing  and  spurious  data  were 
assigned,  can  be  found  elsewhere  in  this  volume  in  the  paper  by  Nancy  Chinchor. 

TEST  PROCEDURE 

Final  testing  was  done  on  a  test  set  of  100  previously  unseen  texts  that  were 
representative  of  the  corpus  as  a  whole.  Participants  were  asked  to  copy  the  test 
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package  electronically  to  their  own  sites  when  they  were  ready  to  begin  testing. 
The  testing  had  to  be  conducted  and  the  results  submitted  within  a  week  of  the  date 
when  the  test  package  was  made  available  for  electonic  transfer.  Each  site 
submitted  their  system-generated  templates,  the  outputs  of  the  scoring  program 
fscorc  reports  and  the  interactive  scoring  history  file),  and  a  trace  of  the  system’s 

processing  (v-hatever  type  of  trace  the  system  normally  produces  that  could  serve 
to  help  validate  the  system's  outputs).  Initial  scoring  was  done  at  the  individual 
sites.  With  someone  designated  as  interactive  scorer  who  preferably  had  not  been 
part  of  the  system  development  team.  After  the  conference,  the  system-generated 
templates  for  all  sites  were  labeled  anonymously  and  rescored  by  two  volunteers  in 
order  to  ensure  that  the  official  scores  were  obtained  as  consistently  as  possible. 

The  system  at  each  site  was  to  be  frozen  before  the  test  package  was 
transferred:  no  updates  were  permitted  to  the  system  until  testing  and  scoring 
were  completed.  Furthermore,  no  backing  up  was  permitted  during  testing  in  the 
event  of  a  system  error.  In  such  a  situation,  processing  was  to  be  aborted  and 

restarted  with  the  next  text.  A  few  sites  encountered  unforeseen  system  problems 

that  were  easily  pinpointed  and  fixed.  They  reported  unofficial,  revited  test  results 

at  the  conference  that  were  generally  similar  to  the  official  test  results  and  do  not 

alter  the  overall  picture  of  the  current  state  of  the  art. 

The  basic  test  called  for  systems  to  be  set  up  to  generate  templates  that 
produced  the  "maximum  tradeoff  between  recall  and  precision,  i.e.,  templates  that 
achieved  scores  as  high  as  possible  and  as  similar  as  possible  on  both  recall  and 

precision.  This  was  the  normal  mode  of  operation  for  most  systems  and  for  many 

was  the  only  mode  of  operation  that  the  developers  had  tried.  Those  sites  that  could 
offer  alternative  tradeoffs  were  invited  to  do  so,  provided  they  notified  NOSC  in 
advance  of  the  particular  setups  they  intended  to  test  on. 

In  addition  to  the  scores  obtained  for  these  metrics  on  the  basic  template- 
generation  task,  scores  were  obtained  of  system  performance  on  the  linguistic 

phenomenon  of  apposition,  as  measured  by  the  template  fills  generated  by  the 
systems  in  particular  sets  of  instances.  That  is,  sentences  exemplifying  apposition 
were  marked  for  separate  scoring  if  successful  handling  of  the  phenomenon 

seemed  to  be  required  in  order  to  fill  one  or  more  template  slots  correctly  for  that 

sentence.  This  test  was  conducted  as  an  experiment  and  is  described  in  the  paper 

by  Nancy  Chinchor  elsewhere  in  this  volume. 

TEST  RESULTS  AND  DISCUSSION 

Scatter  plots  for  selected  portions  of  the  final  test  results  are  shown  in  the 

appendix.  The  data  points  are  labeled  with  abbreviated  names  of  the  15  sites,  and 
optional  test  runs  are  marked  with  the  site's  name  and  an  "0"  extension.  The  plots 
present  an  interesting  picture  of  the  MUC-3  results  as  a  whole,  but  the  significance 
of  the  numbers  for  each  of  the  tested  systems  needs  to  be  assessed  on  the  buis  of  a 

careful  reading  of  the  MUC-3  proceedings  papers  that  were  submitted  by  each  of 

the  sites.  The  level  of  effon  that  could  be  afforded  by  each  of  the  sites  varied 

considerably,  as  did  the  maturity  of  the  systems  at  the  start  of  the  evaluation.  All 

sites  were  operating  under  time  constraints  imposed  by  the  evaluation  schedule. 
In  addition,  the  evaluation  demands  were  a  consequence  of  the  intricacies  of  the 
task  and  of  general  corpus  characteristics  such  as  the  following: 
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•  The  texts  that  are  relevant  to  the  MUC-3  task  (comprising  approximately  50% 
of  the  total  corpus)  arc  likely  to  contain  more  than  one  relevant  incident. 

•  The  information  on  a  relevant  incident  may  be  dispersed  throughout  the  text 

and  may  be  intertwined  with  accounts  of  other  (relevant  or  irrelevant)  incidents. 

•  The  corpus  includes  a  mixture  of  material  (newspaper  articles.  TV  news, 

speeches,  interviews,  propaganda,  etc.)  with  varying  text  structures  and  styles. 

The  scoring  program  produces  four  sets  of  overall  scores,  three  of  which  are 

based  on  different  means  of  assessing  penalties  for  missing  and  spurious  data.  The 

set  called  Matched/Missing  is  a  compromise  between  two  others  and  is  used  as  the 
official  one  for  reporting  purposes.  Figure  Al  is  based  on  the  Matcbed/Missing 
method  of  assessing  penalties.  The  fourth  method  does  the  scoring  only  for  those 
slots  that  require  set  fills,  i.e.,  fills  that  come  from  predefined  sets  of  categories. 
Figure  A2  is  based  on  that  method  of  scoring.  The  various  methods  are  described 
more  fully  in  the  paper  by  Nancy  Chinchor. 

Figure  Al  gives  the  most  general  picture  of  the  results  of  MUC-3  final  testing. 
It  shows  that  precision  always  exceeds  recall  and  that  the  systems  with  relatively 
high  recall  are  also  the  ones  that  have  relatively  high  precision.  The  latter  fact 
inspires  an  optimistic  attitude  toward  the  promise  of  at  least  some  of  the  techniques 
employed  by  today's  systems  -  further  efforts  to  enhance  existing  techniques  and 
extend  the  systems'  domain  coverage  may  lead  to  significantly  improved 
performance  on  both  measures.  However,  since  all  systems  show  better  precision 
than  recall,  it  appears  that  it.  will  be  a  bigger  challenge  to  obtain  very  high  recall 
than  it  will  be  to  achieve  higher  precision  at  recall  levels  that  are  similar  to  those 
achievable  today. 

The  distribution  of  data  points  tentatively  supports  at  least  one  general 

observation  about  the  technologies  that  underiy  today's  systems:  those  systems 

that  use  purely  stochastic  techniques  or  handcrafted  pattern-matching  techniques 
were  not  able  to  achieve  the  same  level  of  performance  for  MUC-3  as  some  of  the 
systems  that  used  parsing  techniques.  The  "non-parsing"  systems  are  ADS,  HU, 
MDC,  UNI.  UNL.  UNL-01.  and  UNL.02,  and  the  "parsing"  systems  are  BBN,  BBN-O,  GE. 
GTE.  ITP,  LSI.  NYU,  NYU-Ol,  NYU-02.  PRC,  SRI.  SYN,  UMA,  and  UMA-0. 

Further  support  for  this  observation  can  be  found  in  Figure  A2,  where  the 

scores  are  computed  for  all  slots  requiring  set  fills,  and  in  Figure  A3,  which  shows 
the  scores  for  just  one  of  those  set-fill  slots.  In  these  cues,  one  might  expect  the 
non-parsing  systems  to  compare  more  favorably  with  the  parsing  systems,  since 

the  fill  options  are  restricted  to  a  fairly  small,  predefined  set  of  possibilities. 
However,  none  of  the  non-paning  systems  appears  at  the  leading  edge  in  Figure 
A2,  and  the  only  non-parsing  system  in  the  cluster  at  the  leading  edge  in  Figure  A3 
is  ADS  (which  shares  a  data  point  with  NYU-02),  although  a  few  non-paning 
systems  have  extremely  high  precision  scores  (UNI.  Ut^,  UNL-Ol,  and  UNL-02). 

On  the  other  hand,  there  is  quite  a  range  in  performance  even  among  the 
systems  in  the  parsing  group,  all  of  which  had  to  cope  with  having  limited 
coverage  of  the  domain.  One  thing  that  is  apparent  from  the  sites'  system 
descriptions  is  that  all  the  ones  on  the  leading  edge  in  Figure  Al  have  the  ability  to 
make  good  use  of  partial  sentence  parses  when  complete  panes  cannot  be  obtained. 
Level  of  effort  is  also  an  indicator  of  performance  success,  though  not  a  completely 
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reliable  one:  GE.  NYU.  and  UMass  all  reported  investing  more  than  one  person- 
year  of  effort  on  MUC-3.  but  several  other  sites  with  lower  overall  performance 
also  reported  just  under  or  over  one  person-year  of  effort. 

It  must  be  said  that  there  were  some  extremely  immature  systems  in  the  non¬ 
parsing  group  and  the  parsing  group  alike,  so  any  general  conclusions  must  be 
taken  as  tentative  and  should  certainly  not  be  used  to  form  opinions  about  the 

relative  validity  of  isolated  techniques  employed  by  the  individual  systems  in  each 
group.  It  could  be  that  the  relatively  low-performing  systems  use  extremely 
effective  techniques  that,  if  supplemented  by  other  known  techniques  or 

supported  by  more  extensive  domain  coverage,  would  put  the  system  well  out  in 

front.  One  should  also  not  assume  that  the  systems  at  the  leading  edge  are  similar 
kinds  of  systems.  In  fact,  those  systems  have  quite  different  architectures  and 

have  varying  sizes  of  lexicons,  kinds  of  parsers  and  semantic  interpreters,  etc. 

In  addition  to  showing  how  system  performance  varies  from  one  slot  to 
another.  Figures  A3.  A4  and  A5  show  how  spurious  data  generation  combines  with 
incorrect  data  generation  to  affect  the  precision  scores  in  different  kinds  of  slots. 
Figure  A4  is  for  the  TEMPLATE  ID  slot.  The  fillers  of  this  slot  are  arbitrary 
numbers  that  uniquely  identify  the  templates  for  a  given  message.  The  scoring 
program  disregards  the  actual  values  and  finds  the  best  match  between  the  system¬ 
generated  templates  and  the  answer  key  templates  for  a  given  message  based  on 
the  degree  of  match  in  fillers  of  other  slots  in  the  template.  Since  there  is  no  such 
thing  as  an  incorrect  template  ID,  only  a  spurious  or  missing  template,  and  since 
missing  data  plays  no  role  at  alt  in  computing  precision,  the  only  penalty  to 
precision  for  the  TEMPLATE  ID  slot  is  due  to  spurious  data  generation.  In  contrast 
to  the  TEMPLATE  ID  slot,  the  TYPE  OF  INCIDENT  slot  (Figure  A3)  shows  no 
influence  of  spurious  data  on  precision  at  all.  This  is  because  the  TYPE  OF 

INCIDENT  slot  permits  only  one  filler.  The  HUMAN  TARGET:  ID(S)  slot  (Figure 
A5)  can  be  filled  with  indefinitely  many  fillers  and  thus  shows  the  impact  of  both 
incorrect  and  spurious  data  on  precision. 

Four  sites  submitted  results  for  the  optional  test  runs  mentioned  in  the 
previous  section  —  BBN  Systems  and  Technologies  (BBN-0),  New  York  University 
(NYU-01  and  NYU-02),  the  University  of  Massachusetts  (UMA-0),  and  the 

University  of  NebraskaAJniversity  of  Southwestern  Louisiana  (UNL-01  and  UNL- 
02).  These  sites  conducted  radically  different  experiments  to  generate  templates 
more  conservatively.  The  BBN-0  experiment  largely  involved  doing  a  narrower 
search  in  the  text  for  the  template-filling  information;  the  NYU-01  and  NYU-02 
experiments  involved  throwing  out  templates  in  which  certain  key  slots  were 
either  unfilled  or  were  filled  with  information  that  indicated  an  irrelevant 
incident  with  good  probability:  the  UMA-0  experiment  bypassed  a  case-based 

reasoning  component  of  the  system:  and  the  UNL-01  and  UNL-02  experiments 

involved  the  usage  of  different  thresholds  in  their  conneciioni' t  framewo^.  The 
experiments  resulted  in  predicted  differences  in  the  Matebed/Missing  scores 
compared  to  the  basic  test.  In  almost  all  cases  the  experiments  had  the  overall 
effect  of  lowering  recall:  in  all  cases  they  lowered  overgeneration  and  thereby 
raised  precision.  Figure  A4  shows  the  marked  difference  the  experiments  made  in 
spurious  template  generation:  Figure  A1  shows  the  much  smaller  difference  they 
made  in  overall  recall  and  precision. 
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CONCLUSIONS 


The  MUC-3  evaluation  established  a  solid  set  of  performance  benchmarks  for 
systems  with  diverse  approaches  to  text  analysis  and  information  extraction.  The 
MUC-3  task  was  extremely  challenging,  and  the  results  show  what  can  be  done  with 
today's  technologies  after  only  a  modest  domain-  and  task-specific  development 
effort  (on  the  order  of  one  person-year).  On  a  task  this  difficult,  the  systems  that 
cluster  at  the  leading  edge  were  able  to  generate  in  the  neighborhood  of  40-50%  of 
the  expected  data  and  to  do  it  with  55-65%  accuracy.  Breakdowns  of  performance 
by  slot  show  that  performance  was  best  on  identifying  the  type  of  incident  -  70- 
80%  completeness  and  80-85%  accuracy  were  achieved,  and  accuracy  figures  in  the 
90-100%  range  were  possible  with  some  sacrifice  in  completeness. 

All  of  the  MUC-3  system  developers  are  optimistic  about  the  prospects  for 
seeing  steady  improvements  in  system  performance  for  the  foreseeable  future. 
This  feeling  is  based  variously  on  such  evidence  as  the  amount  of  improvement 

achieved  between  the  dry-run  test  and  the  final  test,  the  slope  of  improvement 

recorded  on  internal  tests  conducted  at  intervals  during  development,  and  the 

developers'  own  awareness  of  significant  components  of  the  system  that  they  had 
not  had  time  to  ada^at  to  the  MUC-3  task.  The  final  test  results  are  consistent  with 
the  claim  that  most  systems,  if  not  all.  may  well  be  still  on  a  steep  slope  of 
improvement.  However,  they  also  show  that  performance  on  recall  (completeness) 
is  not  as  good  as  performance  on  precision  (accuracy),  and  they  lend  support  to  the 
possibility  that  this  discrepancy  will  persist.  It  appears  that  systems  cannot  be 
built  today  that  are  capable  of  obtaining  high  overall  recall,  even  at  the  expense  of 
outrageously  high  cvergeneration.  Systems  can,  however,  be  built  that  will  do  a 
good  job  at  potentially  useful  subtasks  such  as  identifying  terrorist  incidents  of 
various  kinds. 

The  results  give  at  least  a  tentative  indication  that  systems  incorporating 

robust  parsing  techniques  show  more  long-term  promise  of  high  performance 
than  non-parsing  systems.  However,  there  are  great  differences  in  techniques 
among  the  systems  in  the  parsing  and  non-parsing  groups  and  even  among  those 
robust  parsing  systems  that  did  the  best  in  optimizing  the  overall  tradeoff  between 
recall  and  precision.  Further  variety  was  evident  in  the  optional  test  runs 
conducted  by  some  of  the  sites.  Those  runs  show  promise  for  the  development  of 
systems  that  can  be  "tuned"  in  various  ways  to  generate  data  more  aggressively  or 
more  conservatively,  yielding  tradeoffs  between  recall  and  precision  that  respond 
to  differences  in  emphasis  in  real-life  applications. 
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APPENDIX 
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Figure  A4.  Recall  vs  Precisioa  for  Template  ID  Slot 
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Figure  AS.  Recall  vs  Precision  for  Human  Target  ID  Slot 
(Some  systems  did  not  attempt  to  fill  this  slot  and  therefore  do  not  appear  in  the  figure.) 
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This  presentation  describes  the  development  of  the  evaluation  metrics  and  the 
linguistic  phenomena  tests  for  the  DARPA-sponsored  Third  Message  Understanding 
Conference  (MUC*3).  The  systems  participating  in  the  conference  were  evaluated  for 
their  performance  on  a  specific  data  extraction  task.  The  systems  represented  a 

variety  of  approaches  to  the  problem  of  data  extraction  many  of  which  include 
natural  language  processing.  The  task  for  evaluation  was  based  on  news  reports  of 
potential  terrorist  activities  and  the  task  design  is  discussed  in  detail  in  a  separate 
presentation  by  Beth  Stmdheim  of  the  Naval  Ocean  Systems  Center. 

The  message  understanding  systems  participating  in  the  MUC>3  evaluation 
produce  filled  database  templates  for  test  messages.  The  template  fills  were  scored 
semi>automatically  against  a  human«produced  key  by  software  developed  especially 
for  MUC*3.  The  scoring  algorithm  implemented  in  that  software  wu  designed  bued 
on  initial  consultations  among  members  of  the  Program  Committee,  criteria 
determined  during  the  process  of  fully  specifying  the  overall  evaluation  metrics  for 

a  dry  run  of  the  testing  procedure  (February  1991),  and  discussions  following  the  dry 
run.  This  presentatipn  describes  the  evaluation  metrics,  the  rationale  behind  them, 
and  their  utility  in  the  final  evaluation. 

In  addition  to  these  official  overall  metrics,  linguistic  phenomena  tests  were 

also  run  for  each  of  the  systems.  These  tests  have  been  design^  in  two  phases,  with 
an  analysis  of  the  linguistic  phenomena  test  experiment  concerning  the  validity  of 
the  tests  completed  in  the  second  phase  (May  1991).  This  presentation  discusses  the 
development  of  the  linguistic  phenomena  tests  and  the  results  of  the  linguistic 
phenomena  test  experiment. 

EVALUATION  METRICS 

The  evaluation  metrics  for  MUC>3  were  based  on  tallying  raw  scores  for  the 
template  slot  fills  given  by  the  system.  The  templates  were  identified  by  the  message 
identifier  and  the  template  number  for  that  message.  Templates  could  be  generated 

in  any  order  for  a  particular  message  and  the  scoring  system  would  map  them  to  the 
key  in  a  manner  which  optimized  the  score.  Participants  could  remap  the  templates 
by  hand  and  a  history  of  that  remapping  would  be  kept  for  the  official  record.  The 
other  slots  pertaining  to  the  incident  contained  two  types  of  fills,  those  that  come 
from  a  finite  specified  list  of  fills  and  those  that  are  string  fills  that  essentially  come 
from  an  infinite  set  or  a  set  with  an  indetemiinable  cardinality.  These  two  types  of 
slots  are  treated  differently  in  the  scoring. 
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Tallies  for  both  kinds  of  slots  were  kept  as  to  whether  the  fills  were  correct, 
incorrect,  noncommittal,  partially  correct,  spurious,  or  missing  (see  Figure  1).  a 

system  response  was  considered  correct  if  it  exactly  matched  the  answer  key.  It  was 

considered  incorrect  if  it  did  not  match  the  answer  key.  Possible  mismatches  caused 
the  semi-automated  scoring  software  to  prompt  the  user  for  information  as  to  the 
correctness  of  the  mismatched  slot  filler.  In  the  case  of  set  fills,  the  most  credit  the 
user  could  give  was  partial  credit.  In  the  case  of  string  fills,  full  credit  could  be 
given.  Noncommittal  slot  fills  were  those  for  which  both  the  answer  key  and  the 

system  response  was  blank.  A  system  response  was  spurious  if  the  answer  key  was 

blank  and  the  system  bad  filled  the  slot.  The  system  response  was  scored  as  missing  if 
it  was  blank  where  the  answer  key  had  a  fill. 


*  •  *  total  slot  scores  *  *  * 


SLOT 

POS 

ACT  1  COR  PAR 

INC 1  ICR  IPAI5PO 

HIS 

NOHIREC 

PRE 

OVG  : 

FAL 

template-id 

118 

115H14 

0 

01 

0 

01 

1 

4 

391 

97 

99 

1 

incident-date 

114 

1101 

90  10 

101 

31  101 

0 

4 

41 

83 

86 

0 

incident-type 

118 

1141112 

1 

11 

0 

11 

0 

4 

01 

95 

99 

•0 

0 

category 

90 

1091 

88 

0 

01 

0 

01 

21 

2 

71 

91 

81 

19 

14 

indiv-perpa 

106 

611 

59 

0 

21 

10 

01 

0 

45 

501 

56 

97 

0 

org-perpa 

71 

681 

58 

0 

11 

IS 

01 

9 

12 

481 

82 

85 

13 

perp-confidence 

71 

681 

56 

1 

21 

12 

11 

9 

12 

481 

80 

33 

13 

2 

phya-target-ida 

59 

571 

54 

3 

01 

14 

31 

0 

2 

771 

94 

97 

0 

phya-target-nvut 

41 

411 

39 

0 

21 

0 

01 

0 

0 

771 

95 

95 

0 

phys-target-typea 

59 

571 

52 

4 

11 

11 

41 

0 

2 

771 

92 

95 

0 

0 

human-target-ida 

145 

1311 

129 

2 

01 

33 

21 

2 

14 

231 

90 

99 

2 

human-target-nia 

94 

881 

79 

6 

21 

0 

61 

1 

7 

231 

87 

93 

1 

huoan-targat-types 

145 

1311 

126 

2 

31 

24 

21 

2 

14 

231 

88 

97 

2 

0 

target-nationality 

35 

191 

17 

2 

01 

3 

21 

16 

1031 

51 

95 

0 

0 

inatrument-typea 

25 

221 

16 

1 

01 

0 

01 

,v5 

8 

881 

66 

75 

23 

0 

incident-location 

118 

1131 

88  : 

24 

11 

0 

11 

y 

5 

01 

19 

81 

0 

phya-effects 

41 

441 

37 

3 

01 

8 

31 

4 

1 

891 

94 

88 

9 

0 

buman-effecta 

56 

541 

43 

2 

21 

10 

21 

8 

9 

811 

71 

81 

IS 

1 

MATCHED  ONLY 

1464 

1402 

11257 

61  271171 

371 

62  119  8261 

88 

92 

4 

MATCHED/MISSING 

1506 

1402 

11257 

61  271171 

37 1 

62  161  1571 

as 

92 

4 

ALL  TEMPLATES 

1506 

1420 

11257 

61  271171 

371 

80  I 

61  8611 

85 

91 

6 

SET  FILLS  ONLY 

640 

6181 

547 

16 

91 

68 

151 

49 

68 

5161 

87 

90 

8 

0 

Figart  1:  Summary  Score  Report 


The  four  evaluation  metrics  used  in  MUC'3  were  recall,  precision,  falloot,  and 
overgeneration.  These  four  metrics  were  defined  in  terms  of  the  categories 
mentioned  above  (see  Figure  2).  Recall  was  the  sum  of  the  points  for  actual  attempts 
divided  by  the  total  possible  points.  The  numerator  was  the  sum  of  the  number  of 
correct  answers  and  0.3  times  the  number  of  partially  correct  answers.  The 
denominator  was  the  number  of  slot  fillers  in  the  answer  key.  Recall  measured  the 
completeness  with  which  the  system  extracted  data.  It  was  a  measure  of  the  amount 
of  relevant  data  the  system  put  in  the  templates  relative  to  the  total  amount  of  data 
that  should  have  been  put  in  the  templates.  Recall  was  calculated  for  each  slot  and 
over  all  slots.  Recall  was  an  important  metric  in  the  evaluation  of  the  message 
understanding  systems  because  of  the  importance  of  completeness  in  the  extraction 
of  data. 
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recall 


correct  +  (partial  x  O.S) 
possible 

where  possible  =  number  of  required  slot  fillers  in  key 
+  number  of  matched  optional  values 


precision 


correct  +  (partial  x  0.5) 
actual 


where  actual  =  number  of  slot  fillers  in  response 


overgeneration 


spurious 

actual 


fallout 


incorrect  4-  spurious 
possible  incorrect 

where  possible  incorrect  =  number  of  possible 

incorrect  answers  which 
could  be  given  in  response 


Figure  2:  Calculation  of  Metrics 


Precision  was  a  measure  of  the  accuracy  of  the  system's  answers.  It  was 

calculated  by  dividing  the  sum  of  points  for  all  actual  attempts  by  the  total  possible 
points  if  ail  actual  attempts  were  correct  The  numerator  was  the  sum  of  the  number 
of  the  correct  answers  and  0.5  times  the  number  of  partially  correct  answers.  The 
denominator  was  the  sum  of  the  number  of  correct,  partially  correct,  incorrect,  and 
spurious  answers  generated  by  the  system.  The  number  of  spurious  slot  fillers 
generated  by  the  system  affected  the  overall  precision  because  it  was  necessary  to 
penalize  for  systems  that  overgenerate  slot  fillers  in  an  attempt  to  maximize  their 
score.  Precision  was  a  measure  of  the  amount  of  relevant  data  the  system  put  into  the 
templates  relative  to  the  total  amount  of  the  data  the  system  put  into  the  templates.  It 

was  the  tendency  of  the  system  to  avoid  usijpiing  bad  fillers  as  it  assigned  more  good 

fillers.  If  overgeneration  tends  to  be  proportional  to  correct  generation,  then 
precision  is  a  measure  of  overgeneration.  Precision  was  calculated  for  individual 
slots  as  well  as  for  all  slots.  Precision,  like  recall,  wu  an  important  metric  in  the 
evaluation  of  message  understanding  systems  because  it  indicated  how  well  the 
system  performed  on  the  fillers  it  actually  generated. 

Fallout  measured  the  tendency  of  a  system  to  assign  more  incorrect  fillers  as 

the  number  of  potential  incorrect  fillers  increased.  Fallout  was  calculated  by 
dividing  the  number  of  incorrectly  given  fillers,  i.e.,  the  number  of  incorrect  and 
spurious  fillers  given  for  the  slot,  by  the  number  of  possible  incorrect  fillers. 
Because  of  its  dependence  on  the  cardinality  of  the  set  of  fillers  available  for  a  slot, 
fallout  could  only  be  measured  for  those  slots  in  the  MUC-3  templates  which  were 
filled  from  finite  seta.  No  global  fallout  score  could  be  calculated  due  to  the  fact  that 
not  all  slots  were  filled  from  finite  sets.  A  partial  global  score  was  calculated  for  all 
slots  that  were  filled  from  finite  sets.  Fallout  was  a  measure  of  overgeneration  if 
overgeneration  was  proportional  to  the  number  of  opportunities  to  overgencrate. 
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Fallout  was  only  somewhat  important  in  the  evaluation  because  it  was  an  attempt  to 
measure  false  positives  for  a  task  which  did  not  lend  itself  well  to  this  sort  of 
measurement  due  to  its  open-ended  nature. 

The  "ovcrgencration"  measure  was  a  measure  of  the  amount  of  spurious  fillers 
assigned  in  relation  to  the  total  assigned.  It  was  calculated  by  dividing  the  number  of 
spurious  slot  fillers  by  the  number  of  slot  fillers  given.  Overgeneration  was 
calculated  for  individual  slots  as  well  as  all  slots.  The  overgeneration  measure  was  an 
important  part  of  the  final  analysis  of  the  systems  because  it  isolated  a  key  aspect  of 
precision  and  shed  further  light  on  the  trade-off  between  overgeneration  and  recall. 

In  the  discussions  following  the  dry  run.  it  was  decided  that  four  summary 
rows  of  metrics  would  be  included  in  the  score  repirt.  The  four  rows  would  be 
calculated  based  on  different  ways  of  tallying  the  raw  scores  that  go  into  the 
calculation  of  the  metrics.  Three  of  the  rows  treated  spurious  and  missing  templates 
with  varying  levels  of  strictness  depending  on  whether  the  spurious  and  missing  slot 
fills  in  those  templates  were  scored  or  whether  just  the  template  id  slot  was  penalized 
for  the  error. 

The  first  row,  called  "matched  only."  scored  spurious  and  missing  templates 
only  in  the  template  id  slot.  The  second  row.  called  "matched/missing,"  scored 
missing  slot  fills  as  missing  for  ail  the  affected  slots.  This  row  was  used  as  the  official 
score  for  the  final  MUC-3  test  run  and  was  the  only  summary  row  in  the  dry  run.  The 
third  row,  called  "all  templates."  scored  spurious  and  missing  slot  fills  in  all  the  slot 
rows  affected.  This  row  represented  the  strictest  scores. 

A  fourth  and  final  summary  row  was  also  given  for  "set  fills  only."  This  row 
represented  how  the  systems  did  for  the  slots  whose  fills  came  from  finite  sets.  A 
global  fallout  score  was  calculat'  d  from  the  totals  in  this  summary  row  in  an  attempt 
to  get  a  "false  alarm"  rate  for  •ic  systems.  However,  this  global  fallout  score  is  not 
representative  of  the  task  becawse  of  the  considerable  number  of  important  string 
fill  slots  not  included.  The  effort  to  measure  the  false  alarm  rate  has  been  hampered 
because  the  number  of  possible  incorrect  cannot  be  detennined  for  slots  whose  fills 
come  from  a  potentially  infinite  set. 

The  variety  of  summary  scores  was  provided  both  to  determine  the  appropriate 
way  to  score  the  spurious  and  missing  templates  and  to  provide  an  indication  of  how 
well  systems  would  do  for  applications  with  varying  requirements.  Applications  may 
have  differing  tolerances  for  the  amount  of  data  missing  from  the  generated 
database  and  the  amount  of  spurious  data  entered  into  the  database.  An  analysis  of 
the  final  results  of  MUC’3  wu  done  by  plotting  recall  versus  precision, 
overgeneration  versus  recall,  and  overgeneration  versus  precision.  The  results 
indicate  that  the  "all  templates"  score  reflects  the  expected  effect  of  overgeneration 
on  the  plots. 

A  more  extensive  report  on  the  evaluation  metrics  entitled  "  MUC-3  Evaluation 
Metrics"  appears  in  the  proceedings  of  the  Third  Message  Understanding  Conference 
{MUC-3)  published  by  DARPA. 

LINGUISTIC  PHENOMENA  TESTING 

Linguistic  phenomena  testing  can  supplement  the  overall  evaluation  of  data 
extraction  systems  because  the  tests  measure  performance  on  a  set  of  characteristics 
of  the  input  critical  to  the  output.  For  the  dry  run  held  in  February,  there  were 
three  main  linguistic  phenomena  tests  devised  representing  increasing  frequency 
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of  ihc  phenomena  in  ihc  test  messages.  The  first  tested  for  processing  of  negatives 

in  filling  two  slots  in  the  template  fill  task.  The  results  of  the  dry  run  and  the 
information  from  participants  strongly  suggested  that  negation  should  be  tested 
more  broadly  in  terms  of  slots  filled  and  in  terras  of  linguistic  constructions 

considered,  if  possible.  The  second  linguistic  phenomena  test  was  for  the  processing 
of  conjunctions  in  filling  two  slots.  The  dry  run  showed  that  this  was  a  us(;ful  test 

with  regards  to  the  task  and  the  systems.  Conjunction  was  more  frequent  than 
negation  and  was  frequent  'enough  to  be  representative.  The  third  linguistic 

phenomena  test  was  constructed  around  the  terrorist  incident  type  verb  forms 
(active  versus  passive)  and  the  kind  of  clauses  in  which  they  appeared  (main  versus 
subordinate).  The  intent  was  to  determine  if  the  systems  performed  differently  on 
slots  that  required  processing  of  active  versus  passive  verb  forms  and  main  versus 
subordinate  clauses.  The  preliminary  results  showed  that  the  test  results  for  this 
third  test  coincided  with  overall  system  performance  and  did  not  necessarily  reveal 
new  information  about  linguistic  processing  capabilities.  This  test  was  also  the  most 
difficult  of  the  three  to  devise  because  of  the  process  required  to  determine  which 
slots  reflected  the  linguistic  processing  we  were  testing.  We  limited  the  scoring  to 
slot  fillers  appearing  in  the  same  sentence  as  the  verb  and  not  appearing  anywhere 
else  in  the  message. 

In  general,  it  was  decided  as  a  result  of  the  dry  run  that  the  linguistic 
phenomena  tests  would  only  begin  to  be  meaningful  once  the  systems  were 
performing  at  a  higher  level.  This  threshold  of  performance  may  be  reached  in 
MUC-3  or  may  not  be  reached  until  MUC-4.  However,  it  was  also  decided  that  the 
development  of  linguistic  phenomena  tests  was  to  proceed  in  parallel  with  the 
development  of  the  systems.  The  validity  of  the  phenomena  testing  became  the  focus 
of  an  experiment  run  in  the  second  and  final  phase  of  MUC'3  in  May  1991. 

The  experiment  was  designed  to  determine  if  linguistic  phenomena  could  be 
isolated.  All  of  the  phenomena  tests  for  MUC-3  were  scored  using  the  MUC-3  scoring 
system.  The  experiment  was  run  for  the  phenomenon  of  apposition  of  noun  phrases, 
for  example,  "David  Lecky,  Director  of  the  Columbus  school."  One  or  more  appositives 
containing  information  critical  to  the  template  fill  task  appeared  in  approximately  60 
sentences  in  the  test  corpus.  This  frequency  was  higher  than  the  required 
frequency  of  20  indicated  by  the  dry  run. 

The  appositives  were  scored  in  several  ways  to  determine  the  validity  of  the 
testing.  The  scores  on  slots  filled  from  phrases  and  sentences  containing  appositive 
constructions  were  different  from  the  ovenUl  scores  for  the  systems  indicating  that 
it  was  possible  that  the  phenomenon  wu  being  isolated.  The  scores  for  slots  filled 
from  phrases  and  sentences  were  similar  indicating  that  future  phenomena  tests 
could  be  designed  based  on  information  in  the  entire  sentence  containing  the 
phenomenon.  The  appositives  were  subjectively  divided  into  subsets  based  on 
complexity  prior  to  testing.  The  systems  scored  consistently  higher  on  the  simpler 
set  than  they  did  on  the  more  complex  set  This  result  provided  more  confidence  that 
the  phenomenon  was  being  isolated. 

The  appositives  were  also  divided  according  to  whether  they  were  postposed  or 
proposed.  A  postposed  appositive  from  the  test  corpus  was  "Jose  Parada  Grandy,  the 
Bolivian  Police  Chief."  "Rede  Golobo  journalist  Carlos  Marcelo"  is  a  proposed 
appositive.  If  systems  scored  differently  on  these  types  of  appositives,  we  would  have 
more  confidence  that  we  were  testing  appositives.  Neither  type  would  necessarily  be 
easier  because,  although  postposed  appositives  are  more  prototypical,  preposed 
appositives  could  be  processed  as  modifiers.  The  systems  did  score  differently  on  the 
two  types  but  did  not  score  consistently  higher  on  cither  type. 
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There  was  one  pan  of  the  testing  that  did  not  follow  the  hypotheses  put  fonh 
before  the  testing.  That  test  was  constnicted  to  compare  results  for  "minima!  pairs." 
A  set  of  messages  was  constructed  to  be  exactly  like  the  original  test  messages  except 
that  in  place  of  the  appositives.  simple  sentences  assening  the  equivalence  of  the 
appositioned  noun  phrases  were  introduced.  The  test  was  voluntary  because  it 
required  an  additional  run  of  the  data  extraction  systems  as  well  as  scoring.  Two  sites 
volunteered  out  of  the  fifteen  participating.  It  was  hypothesized  that  their  scores 
would  be  higher  on  the  modified  messages  without  the  appositives.  However,  this  was 
not  the  case.  Instead,  the  "simple"  sentences  introduced  a  complexity  not  considered. 
The  use  of  the  copula  in  the  sentences  and  the  reference  resolution  required 
interfered  with  the  strategies  being  used  by  the  systems  to  obtain  the  slot  fills. 
Essentially,  both  systems  ignored  much  of  the  information  in  the  added  sentences. 
The  appositives  were  more  direct  conveyors  of  this  information.  However,  this  result 
also  supports  the  ability  to  isolate  the  phenomenon  because  there  was  a  difference 
when  the  appositives  in  the  messages  were  taken  out.  The  results  of  the  entire 
experiment  indicate  that  phenomena  can  be  isolated  and  that  linguistic  phenomena 
tests  are  valid  when  carefully  designed. 

A  more  extensive  report  on  the  linguistic  phenomena  test  experiment  entitled 
"MUC-3  Linguistic  Phenomena  Test  Experiment”  appears  in  the  proceedings  of  the 
Third  Message  Understanding  Conference  (MUC-3)  published  by  DARPA. 

BRIEF  DESCRIPTION  OF  RESEARCH  ACTIVITIES: 

My  research  has  recently  focused  on  designing  the  metrics  and  phenomena 
tests  for  MUC-3  and  facilitating  the  implementation  of  the  semi- automated  scoring 
system  used  in  the  official  scoring  of  panicipating  systems.  The  MUC-3  Program 
Committee  and  the  participants  have  engaged  in  an  interchange  of  ideu  about  these 
topics.  My  role  has  been  to  be  a  part  of  this  interchange  and  resolve  the  issues 
necessary  to  finalize  the  testing.  In  addition  to  MUC-3,  my  other  research  activities 
include  algorithm  development  and  implementation  for  a  variety  of  applications  of 
artificial  intelligence. 
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8BN  Systamt  and  Tachnologias 
10  Moulton  Straat 
Cambridga,  MA  02138 
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Abstract 

Work  in  spaach  racognition  (SR)  hat  a  history  of  'ivaluation  mathodologiaa  that  parvnit 
comparison  among  various  systamt,  but  until  racantiy  no  mathodotogy  axistad  for  aithar 
davalopars  of  natural  ianguaga  (NL)  intarfacas  or  rasaarchars  in  spaach  undarttanding  (SU)  to 
evaluata  and  compara  tha  systamt  thsy  davalopad. 

Racantiy  contidarabla  progress  has  baan  mada  by  a  numbar  of  groups  involvad  In  tha  OARPA 
Spokan  Langunga  Systams  (SLS)  program  to  agraa  on  a  mathodology  for  comparativa  avaluation 
of  SLS  systamt,  and  that  mathodology  has  baan  usad  in  practica  savaral  timaa.  This  avaluation 
it  probably  tha  only  NL  avaluation  otiMr  than  MUC  to  hava  baan  davalopad  and  uaad  by  a  group 
of  rasaarchars  at  diffarant  sitas,  although  an  axcallant  workshop  was  hold  to  study  soma  of 
thasa  problams  {Palmar,  1988]. 

This  papar  givas  an  ovarviaw  of  tha  procasa  that  was  followad  In  craatlng  a  maanlngful 
avaluation  machanism,  dascribat  tha  currant  machanism,  and  praaanta  soma  problama  and 
diractions  for  futura  davalopmant  Tha  davalopmant  corpora  and  an  matarla)  ralatad  to  tha 
avaluation  will  ba  publically  availabla  from  NIST. 


1.  A  Brief  History 

Tha  goal  of  the  DARPA  Spokan  Language  Systamt  program  is  to  further  rasaarch  and 
damonstrata  the  potential  utility  of  speech  understanding.  Currentiy,  four  major  sites  (BBN, 
CMU,  MIT,  and  SRI)  are  developing  complete  SLS  systems,  and  another  site  (UNISYS)  Is 
integrating  its  NL  corr^rient  with  MITs  speech  systent  Rspresantativee  from  these  and  other 
organizations  meat  raguls/fy  to  discuss  program  goals  and  to  evaluate  progress. 

This  DARPA  SLS  community  formed  a  committee  on  evaluation^  chaired  by  Dave  Pallet!  of  the 
National  Institute  of  Standard  and  Technology.  The  committee  was  to  develop  a  methodology 
for  data  collection,  training  data  dissemination,  and  tasting  for  SLS  systems  under 
development 

The  first  communlty*wide  evaluation  using  the  first  version  of  methodology  developed  by  this 
committee  took  plaM  in  June,  1990,  and  the  second  in  February,  1991.  They  are  reported  in 
[2]  and  [3]  respectively.  Additional  information  about  the  mathodology  can  be  found  In  [Bolsen, 
1989]  and  (Ramshaw,  1990], 

Tha  emphasis  of  the  committee's  work  has  been  on  automatic  evaluation  of  queries  to  an  air 
travel  information  system  (ATIS).  Why  ATI3?  Because  a  database^rrlented  task  seamed 


1  ThcprirnarymenibcnofthccornmiUEearc:  Lyn  Bates  (BBN),  Dcbbb  Dahl  (UNISYS),  Bill  Fisher 
(NIST),  Lynem  Hirschmsn  (MTT),  Bob  Moore  (SRI),  «id  Rich  Stem  (CMU).  ^^y  other  people 
con'jilkjted  to  the  waic  of  the  committee  and  its  subcommittees. 
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most  tractab!*,  and  air  travsl  is  an  application  that  is  aasy  for  avaryon*  to  undarstand.  V/hy 
'automatic*  avaluation?  Becausa  it  was  tha  most  objactiva,  least  axpansiva  practicai  solution 
wa  could  coma  up  with. 

2.  Some  Issues 

Systams  for  NL  undarstanding,  or  spaach  undarstanding  ara  inharantly  much  mora  difficult  to 
avaluata  than  SR  systams.  This  is  bacausa  tha  output  of  spaach  racognition  is  aasy  to  specify  • 
•  it  is  a  character  string  containing  tha  words  that  wars  spokan  as  input  to  tha  system  •  and  It 
is  trivially  aasy  to  datarmina  tha  'right*  answer  and  to  compare  it  to  tha  output  of  a  particular 
SR  system.  Each  of  these  steps, 

1.  specifying  tha  form  that  output  should  taka, 

2.  determining  tha  right  output  for  particular  input,  and 

3.  comparing  tha  right  answer  to  the  output  of  a  particular  system, 

is  vary  problematic  for  NL  and  SU  systams. 


3.  The  Goal 

Tha  goal  of  tha  work  was  to  produce  a  wail-dafinad,  meaningful  evaluation  methodology 
(Implamantad  using  an  automatic  avaluation  system)  which  will  both  permit  meaningful 
comptrisona  between  different  systams  and  also  iJlow  us  to  track  tha  improvement  In  a  single 
NL  or  SL9  systam  over  time.  Tha  systams  are  assumed  to  be  front  ends  to  an  Intaraetiva 
application  (database  inquiry)  In  a  particular  domain  (AT1S). 

Tha  intent  Is  to  avaluata  specifically  NL  undtntMndlng  capabilities,  not  other  aspects  of  a 
systam.  such  as  tha  user  interface,  or  tha  utility  (or  spaed)  of  performing  a  panicular  task 
with  a  systam  that  Includes  a  NL  component. 


4.  The  Evaluation  Framevrork 

The  methodology  that  was  davalopad  is  vary  similar  in  style  to  that  which  has  bean  used  for 
speech  racognition  systams  for  several  years.  It  is: 

1.  Collect  a  sat  of  data  as  large  as  feasibla,  under  corxtltiona  as  realistic  as  poesk>ie. 

2.  Reserve  soma  of  that  corpus  as  a  test  saL  and  distribute  tha  rest  as  a  training  set 

3.  Develop  agreement  on  meanings  and  anewara  for  the  Hama  In  the  teat  saL  artd  an 
automatic  comparison  program  to  compare  those  'right*  answers  wHh  the  answers 
produced  by  various  systems. 

4.  Sand  the  test  sat  to  tha  sHas,  where  they  wlN  be  processed  unaean  and  without 
modifications  to  tha  system.  The  answers  ara  than  returned  and  run  through  the  evaluation 
procedure,  and  tha  rasuHs  reported. 

Rgurs  1  illustratss  tha  relationship  between  an  SLS  system  and  the  evaluation  system. 
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Figure  1:  The  Evaluation  Framework 


4.1  Collecting  Data 

A  method  of  data  collection  called  "Wizard  acenaHoa*  waa  uaed  to  collect  raw  data  (speech  and 
transcribed  text).  This  system  Is  described  In  [Hemphill  1S90].  It  reeulted  In  the  collectJon  of 
a  number  of  human-machine  dialogues. 

It  became  clear  that  the  language  obtained  in  Wlza/d  scenarioe  is  very  strongly  influenced  by 
the  particular  task,  the  domain  and  database  being  used,  and  the  amount  end  form  of  data 
returned  to  the  user. 


4.2  Classifying  Data 

One  of  the  first  things  to  become  clear  was  that  not  ail  of  the  collected  data  was  suitable  as 
test  data,  because  not  all  of  the  queries  posed  by  the  subjects  could  be  answered  by  the  wizard, 
and  some  of  those  that  were  answered  were  clearly  beyond  any  reasonable  goal  for  this 
generation  of  NL  systems.  Thus  K  was  desirmbie  that  the  tndnlng  data  be  marked  to  Indicate 
which  queries  one  might  reasonably  expect  to  find  In  the  test  set 

The  notion  emerged  of  having  a  number  of  daises  of  data,  so  that  we  could  begin  with  a  cere 
(Class  A)  which  was  clearly  definable  and  poesibie  to  evaluate  automatiealty,  and,  as  ws  came 
to  understand  the  evaluation  process  better,  which  could  be  extertded  to  other  types  of  queries 
(Classes  S,  C,  D  etc.). 

Several  possible  classification  systems  were  presented  and  discussed  in  groat  detail.  Two 
have  been  agreed  on  at  this  time,  Class  A,  which  consista  basically  of  IndefMmdent  utterano^ 
with  agreed  upon  meanings,  artd  Class  01,  whteh  corwtlsts  of  vary  short  dialogue  fragmsnte. 


21 


4.3  Agreeing  on  Meaning 


Agra«ing  on  ih«  meaning  of  quariot  has  b««n  on«  of  ths  hardsst  tasks  for  !h«  committe*.  Ths 
issues  are  often  subtle,  and  interact  with  the  .Mructure  and  content  of  the  database  in 
sometimes  unexpected  ways. 

As  an  example  of  the  problem,  consider  a  request  to  *Ltst  the  direct  flights  from  Boston  to 
Dallas  that  serve  meals*.  It  seems  straightforward,  but  should  this  include  flights  that  might  In 
Chicago  without  making  a  connection  there?  Should  it  include  flights  that  serve  a  snadc,  since  a 
snack  is  not  considered  by  some  people  to  be  a  full  meal? 

Without  some  common  agreement,  many  systema  would  produce  very  different  answers  for  the 
same  questions,  all  of  them  equally  right  according  to  the  systems’  own  definitions  of  the 
terms,  but  not  amenable  to  automatic  evaluation.  It  was  necessary  to  agree  on  the  meaning  of 
terms  such  as  *mid*day*,  *meais*  .  *the  fare  of  a  flight',  and  several  dozen  other  things,  but 
this  agreement  was  achieved  and  is  documented. 


4.4  Developing  Reference  Answers 

It  Is  rtot  enough  to  agree  on  meaning  of  queries  in  the  choirsn  domain.  It  la  also  necessary  to 
develop  a  common  understanding  of  what  is  to  be  produced  as  the  answer,  or  part  of  the 
answer,  to  a  question. 

For  example.  If  a  user  asks  *What  Is  the  departure  time  of  the  earliest  flight  from  San 
Francisco  to  Atlanta?*,  orvs  system  might  reply  with  a  single  time  and  another  might  reply  with 
that  time  plus  additional  columns  containing  the  carrier  and  flight  number,  a  third  system  might 
also  Include  ths  arrival  time  and  the  origin  and  destination  airports.  None  of  these  answers 
could  be  said  to  be  wrong,  although  one  might  argue  about  the  advantages  and  disadvantages  of 
Uvtseness  and  verbosity. 

It  was  agreed  that,  for  the  sake  of  autontatie  evaluation,  a  canonical  reference  answer  (the 
minimum  'righr  answer)  should  be  developed  for  each  evaluable  query  In  the  training  set,  and 
tha^*  the  reference  answer  should  be  that  answer  retrieved  by  a  reference  SQL  expression. 
That  Is,  the  right  answer  was  defined  by  the  expression  which  produces  the  answer  from  the 
databank,  as  wsl!  as  the  answer  retriev^.  Thie  ensures  A)  that  It  is  poesft>ie  to  retrieve  the 
cancnic«:  answer  via  SQL,  B)  that  even  if  the  answer  is  empty  or  othetwiee  Iknhed  In  content. 
It  Is  possiblr«  for  systsm  developers  to  understand  what  was  expected  by  looking  at  the  SQL. 
and  C)  ths  rsfe<^nce  answer  contains  the  least  aniount  of  information  needed  to  determine  that 
the  system  produced  the  right  answat. 

What  should  ix'  produced  for  an  answer  U  determined  both  by  domain-independent  Ikiguistio 
principles  (Bolsen,  1989]  artd  domain-specifle  stipulation  (Appendix  A).  The  language  uaed  to 
express  the  arisv  era  is  dsflnsd  in  Appendix  B. 

4.5  Developing  a  Compirator 

A  flnsJ  necessary  oomponsnt  la,  cf  course,  e  program  to  compare  the  reference  anewera  to 
those  produced  by  various  systems.  Of>s  was  written  In  C  by  NIST.;  anyone  interested  In 
obtaining  ths  cods  tor  these  comparators  should  contact  Bill  Fisher  at  NIST. 

Ths  tsak  of  answer  comparison  is  compllcstsd  substantially  by  the  fact  that  the  ca-nonlcal 
answer  is  intentionally  .'ninimai,  but  the  answer  supplied  by  a  systsm  may  contain  extra 
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information.  Some  intaliigortc*  is  nsedad  to  datsrmine  whan  two  antwar  match  (i.a.  timpla 
idantity  tasts  won't  work). 


4.6  Presenting  Results 

Exprassing  results  can  be  almost  as  complicated  as  obtaining  them.  Originally  it  was  thought 
that  a  simple  *X  percent  correct*  measure  would  be  sufficient,  however  it  became  clear  that 
there  was  a  significant  difference  between  giving  a  wrong  answer  and  giving  no  answer  at  all, 
so  the  results  are  now  presented  as:  Number  right,  Number  wrong,  Number  not  answered. 
Weighted  error  percentage  (weighted  so  that  wrong  answers  are  twice  as  bad  as  no  answer  at 
all),  and  Score  (100  •  weighted  error). 

5.  Strengths  of  the  Methodology 

It  forces  advance  agreement  on  the  meaning  of  critical  terms  and  o'!  at  least  minimal 
information  to  he  included  In  the  answer. 

It  is  objective,  to  the  extent  that  a  method  for  selecting  testable  queries  can  be  defined,  end  to 
the  extent  that  the  agreentents  mentioned  above  can  be  reached. 

It  requires  less  human  effort  (primarily  In  the  creating  of  canonical  examples  and  answers) 
than  non*automatic,  more  subjective  evaluation,  it  is  thus  better  suited  to  large  test  sets. 

ft  can  be  easily  extended. 


6.  Weaknesses  of  the  Methodology 

It  does  not  distinguish  between  merely  scceptable  snswcre  and  very  good  answem. 

It  does  not  distinguish  betwesn  some  cssee,  tnd  may  thus  give  undue  credit  to  a  system  that 
*ovar  answers*. 

It  cannot  tall  If  a  system  gets  the  right  answer  for  the  wrong  reasoa 

It  doee  not  adequately  measure  ^  handling  of  some  phenomena,  such  as  extsnded  dlaloguse. 


7.  Future  Issues 

The  hottest  topic  currentfy  facing  the  SL3  community  with  respsct  to  evaluation  le  what  to  do 
about  dialoguea.  Many  of  the  natural  taska  one  might  do  with  a  database  Interface  Involve 
extended  problem-solving  dialogues,  but  no  methodology  exists  for  evaiusting  the  capebUItJes 
of  systerjts  attempting  to  engage  In  dlaloguee  with  users.  Several  suggestions  have  been  made 
((Kifschman,  1990]  and  (Bates,  1991D  »nd  will  be  discussed  In  depth. 
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Appendix  A:  Interpreting  AXIS  queries  Relative  to  the 
Database 


Basics: 

A  largs  class  of  tables  in  the  database  have  entries  that  can  be  taken  as  defining  things  that  can 
be  asked  for  in  a  query.  In  the  answer,  each  of  these  things  will  be  identified  by  giving  a  value 
of  the  primary  key  of  its  table.  These  tables  are: 


Name  of  Table  English  Term(s) 


Primary  Key 


aircraft 

airline 

airport 

city 

compour>d_class 

day 

fare 

flight 

food.service 

ground^service 

month 

restriction 

state 

time_zone 

transport 


aircraft,  equipment 

airline 

airport 

city 

service  classes,  e.g.  'coach*,  etc. 
names  of  the  days  of  the  week 
fare 
flight 

meals 

ground  transportation 

months 
restrictions 
names  of  states 
time  zones 
transport  code 


aircraft.code 

airlinejcode 

airport_code 

city_code 

fare.class 

day.code 

fare_cods 

flight.code 

meal_cods,meal_number, 

meai_claas 

city_codi,airport_code, 

transport.code 

month_number 

restriet^code 

state_code 

time_zooe_codo,  tlme_zone_name 
transport.code 


Special  meanings: 

In  this  arena,  certain  English  expressions  have  special  meanings,  particularly  In  terms  of  the 
database  distributed  by  Tl  in  the  spring  of  1990.  Here  are  the  ones  we  have  agreed  on:  (In  the 
following,  *A.B'  refers  to  field  B  of  table  A.) 


1.  Flights. 

A  flight  "between  X  and  Y*  means  a  flight  •from  X  to  Y*. 

In  an  expression  of  the  form  *filght  number  N*,  where  N  Is  a  number,  N  will  always  be 
interpreted  as  referring  to  the  flight  number  (fllght.flight_number).  "Flight  code  N"  will 
unambiguously  refer  to  flightfllght.code.  "Right  N"'  will  refer  to  fllghLfllghLnumber  if  N  Is  In 
the  range  0  <■  N  «>  9999  but  to  flightfllghtjcode  If  N  >■  100000. 

A  "one-way"  flight  Is  a  flight  with  a  fare  vrhose  one-way  cost  is  non-empty. 

Principle:  If  an  attribute  *X"  of  a  fare,  such  as  "one-way"  or  "coach",  Is  used  as  a  modifiar  of 
a  flight,  It  will  be  interpreted  as  "a  flight  with  an  X  fare". 

2.  Fare  (classes). 

A  "one-way"  fare  is  one  with  a  non-empty  one-way  a>8t 
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RtfartncM  to  th«  ‘cheapott  fare*  are  ambiguoua,  and  may  be  interpreted  aa  meaning  either  the 
cheapeal  one-way  fare,  the  cheapest  round-trip  fare,  the  cheapest  of  either,  or  the  cheaoeat 
of  both. 

A  ‘coach*  fare  is  one  whose  compound_class.class_type  ■  ’COACH*.  Similarly,  the  fare 
modifiers  ‘first  class*,  ‘business  class*,  and  ’thrift  class*  refer  to  values  of  the 
compound_class.class_type  field. 

A  reference  to  ranking  of  fares,  e.g.  ‘fares  that  are  Y  class  or  bettaf*,  will  be  interpreted  as  a 
reference  to  the  rank  of  the  associated  base  fare  (class_of_service.rank).  A  ’discounted 
fare*  is  one  whose  compound_clas8.discounted  -  *YES*. 

An  ’excursion  fare*  is  one  with  a  restriction  code  (fare.restrict.code)  that  contains  the  string 
*AP*,  ’EX’,  or  ’VU*. 

A  ’family  fare*  is  the  same  thing  as  an  ’excursion  fare*. 

A  ’special  faro*  is  one  with  a  non-null  restriction  cods  (fare.restrlct_code). 

3.  Time. 

The  normal  answer  to  otherwise  unmodified  *when*  queries  will  be  a  time  of  day,  not  a  date  or 
a  duration. 

The  answer  to  queries  like  ‘On  what  days  does  flight  X  fly*  will  be  a  list  of  day.day^code 
fields,  not  a  flight.daya  string. 

Queries  that  refer  to  a  time  earlier  than  1300  hours  without  specifying  *a.m.*  or  ’p.m.*  are 
ambiguous  and  may  be  interpreted  as  either. 

4.  Units. 

All  units  will  be  the  same  as  those  implicit  in  the  database  (e.g.  feet  for  aircraft.wing_span,  but 
miles  for  aircraftrange_miles,  durations  in  minutes). 

5.  Meals. 

For  purposes  of  determining  flights  *wlth  meala/meal  service*,  snacks  will  count  as  a  meal 
list  the  types  of  meal*  should  produce  one  tuple  per  mea),  not  a  single  meaJ.code  string. 

8.  "With*  clauses. 

*with*-modif]cation  clauses;  ’show  me  all  the  flights  from  X  to  Y  with  their  fares*  wM  require 
the  identification  of  both  flights  and  their  fares  (so  If  there  are  2  flights,  each  with  three 
fares,  the  answer  will  have  6  tuples,  each  with  at  least  the  flight.code  and  fare.code).  In 
general,  queries  asking  tor  Information  from  two  or  more  separate  tablee  In  the  database  will 
require  the  logical  union  of  fields  that  would  Identify  each  table  entry  separately. 

7.  The  ’Itinerary*  of  a  flight  refers  to  the  set  of  all  non-etqj  lege  of  that  flight  When  an 
'Itinerary*  is  asked  for,  each  leg  of  the  flight  will  be  Idsntifl^  by  the  origin  and  destination 
cities  for  that  leg,  e.g.  ((‘BOS*  *ATL*)  (*ATL*  ’OFW)). 
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8. *what  kind  of  X  is  Y*  queries,  where  Y  is  clearly  a  kind  of  X,  will  be  interpreted  u  equivilent 
to  'what  does  Y  mean?*,  where  Y  is  a  primary  key  value  for  the  table  referred  to  by  X  (see  10 
below). 

9.  ’class*. 

References  to  classes  of  service  will  be  taken  as  referring  to  the  contents  of  the 
compound_class  table  (nos  the  class_of_service  table). 

Queries  about  (unmodified)  ’class  X*,  e.g.  ‘What  is  class  X?*,  will  be  interpreted  as  referring 
to  the  set  of  compound_class.(are_ciass  entries  for  which  'X*  is  the  fare_class,  not  the 
base_class,  e.g.  ■((*X*))'r  not  *((*XA*)(*XB*))*. 

The  expression  ‘(fare)  classes  for  flight  X*  refers  ambiguously  to  either  the  mostly  1> 
character  class  codes  that  are  stored  in  flight_class.fare_ciass  (sometimes  called  ‘booking 
codes’),  or  to  the  largely  multi-character  class  codes  that  are  stored  In  fare.fare_clase  ae 
attributes  of  fares  that  are  associated  with  flight  X  via  the  flight.fare  table  (sometimes  called 
‘fare  bases’).  In  either  case,  the  answer  should  have  the  class  codes  In  separate  (letds,  not 
packed  together  as  they  are  in  flighLclass^string. 

10.  Requests  for  the  ‘meaning*  of  something  will  only  be  interpretable  if  that  tNng  Is  a  code 
with  a  canned  definition  In  the  database.  Here  are  the  things  so  defined,  with  the  fields 
containing  their  decoding; 


Table 

Key  Field 

Decoding  Field 

aircraft 

aircraft^code 

alrcraft_typ# 

airline 

airline_code 

alrilne.name 

airport 

airport_code 

airport.name 

city 

city_code 

city^name 

codo_description 

code 

description 

columnjable 

heading 

columnjdescription 

day 

day_code 

day_name 

food.service 

meai_code 

meal.de  script)  on 

interval 

period 

begln^time,  end_time 

month 

month^number 

month_name 

state 

state.code 

state_name 

timo_2one 

tlme_2one_codo 

timo_zone_namo 

transport 

transport^code 

transport_description 

11.  A  request  for  a  flight's  stops  will  be  Interpreted  as  asking  for  the  final  stop  In  addition  to 
intermediate  slope. 

12.  Queries  that  are  Iftermliy  yes-cr-no  questioos  may  bo  answered  by  either  a  boolean  value 
(‘YES/TRUE/NO/FALSE’)  or  a  relation,  ejq^essed  as  a  tabte.  Referenoe  (REF)  answers  to  such 
questions  will  be  recorded  as  either  a  null  or  a  non-nuH  tabis.  if  the  reterence  answer  te  a  nuH 
table,  then  either  ‘NO/FALSE*  or  a  nuH  table  will  count  as  correct:  if  the  reterence  answer  Is  a 
non^iull  table,  then  either  •YES/TRUE*  or  a  table  that  matches  the  REF  labte  wi8  be  counted  as 
correct  In  other  words,  if  a  table  is  given  as  an  artsvirer,  it  must  match  the  REF  tabi®  to  b® 
counted  correct 
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13.  A  city  and  an  airport  will  be  considered  'near*  (or  'nearby')  each  other  iff  the  city  is 
served  by  the  airport,  and  two  citiea  will  be  considered  'near*  (or  'nearby')  each  other  iff 
there  is  ar  airport  that  serves  them  both. 

14.  When  it  is  clear  that  an  airline  is  being  referred  to,  the  term  'American'  by  itself  will  be 
taken  as  unambiguously  reforring  to  American  Airlines. 

15.  Vague  queries  of  the  form  'Givo  me  information  about  X'  or  'Describe  X'  or  'What  is  X' 
will  be  interpreted  as  equivalent  to  'List  X'. 

16.  References  to  'the  city'  or  'downtown'  are  ambiguous,  and  may  be  interpreted  as 
referring  to  any  city  that  seems  reasonable. 

17.  When  a  query  refers  to  an  aircraft  type  with  a  descriptive  phrase,  such  as  'BOEING  767' 
or  'TYPE  767  -  BY  BOEING*,  the  ref»r«nce  is  ambiguous:  It  may  be  taken  to  be  the  set  of 
entries  in  the  'aircraft'  table  whose  'aircraft.type'  field  (the  second  in  the  current  table) 
matches  the  descriptive  phrase  well,  such  as: 

767,  'BOEING  767  (ALL  SERIES)', ... 

763,  'BOEING  767-300/300ER'.  ... 

or  it  may  be  taken  to  mean  just  the  entry  with  the  matching  'aircraftjcode*  value  (the  first 
field  in  the  current  table). 

18.  Utterances  whoso  answers  require  arithmetic  computation  are  not  now  considered  to  be 
interpretable. 

19.  Cases  like  ‘the  prices  of  fllghU,  first  dass,  from  X  to  Y*.  In  which  the  attachment  of  a 
modifier  that  can  apply  to  either  prices  or  flights  is  unclear,  should  be  (ambiguously) 
interpreted  both  ways,  as  both  the  first-dass  prices  on  flights  from  X  to  Y*  and  'the  prices 
on  first-class  flights  from  X  to  Y*.  More  generally,  if  structural  ambiguities  like  this  could 
result  in  different  (SQL)  Interpreutionc,  they  must  be  treated  as  ambiguoua. 

Appendix  B:  Common  Answer  Specification  (CAS)  Syntax 

BASIC  SYNTAX  IN  BNP: 


<answer> 

<ca3l> 

<scalar-valus> 

<boolean-value> 

<number-valuo> 

<integer> 

<sign> 

<digit> 

<real-numbor> 

<string> 

<ralation> 

<tupte> 

<vaJue> 


>  <cas1>  I  (<:cas1>  OR  <anewer>) 
w  <3calar-v^ue>  |  <relaik>rB>  |  NO_AN3WER  |  no^answer 

■  <boolean-value>  |  <number-vaJue>  |  <string> 

.  YES  I  yes  |  TRUE  |  true  |  NO  j  no  |  FALSE  |  false 

■  <integsr>  |  <real-numbor> 

«  (<8lgn>)  <dlgit>^ 

•  ♦  I  - 

.  0|1|2|3|4|S|6|7i8|9 

■1  <slgrt>  <digit>+  .  <digit>'  |  <diglt>+  ,  <dlglt>* 
a  <char_except_whitespace>'f  |  ’*<char>*" 

■■  (<tuple>*) 

••  (<value>4-) 

a  <scalar-vaiue}>  |  NIL 


Standard  BNF  noution  has  been  extended  to  indudo  two  other  common  devlceo  .  rrmno 

'one  or  more  A's'  arxl  '<A>"  means  'zero  or  more  A's”, 
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The  above  formulation  does  no!  define  <char_except_white8pac#>  and  <char>.  All  of  the 
standard  ASCII  characters  count  as  members  of  <char>,  and  all  but  'white  space*  count  at 
<char_excapt_whitespace>.  Following  ANSI  *C*,  blanks,  horizontal  and  vertical  tabs,  newlines, 
formfeeds,  and  comments  are.  collectively,  'white  space'. 

The  only  change  in  the  syntax  of  CAS  itself  from  the  previous  version  is  that  now  a  string  may 
be  represented  as  either  a  sequence  of  characters  not  containing  white  space  or  as  a  sequence 
of  any  characters  enclosed  in  quotation  marks.  Note  that  only  non-exponential  reaJ  numbers 
are  allowed,  and  that  empty  tuples  are  not  allowed  (but  empty  relations  are). 

ADDITIONAL  SYNTACTIC  CONSTRAINTS 

The  syntactic  classes  <boolean-value>,  <&tring>,  and  <number-vaiuex>  define  the  types 
'boolean',  'string',  and  'numbef.  respectively.  All  the  tuples  in  a  relation  must  have  the 
I'^ame  number  of  values,  and  those  values  must  be  of  the  same  respective  types  (boolean, 
string,  or  number). 

If  a  token  could  represent  cither  a  string  or  a  number,  it  wiH  be  taken  to  be  a  number;  if  it 
could  represent  either  a  string  or  a  boolean,  it  wiH  be  taken  to  be  a  boolean.  Interpretation  as 
a  string  may  be  forced  by  enclosing  a  token  In  quotation  marka. 

in  a  tuple,  NIL  as  the  represenution  of  missing  data  is  aJIosved  aa  a  special  case  for  sny  value, 
so  a  legal  answer  indicating  the  costs  of  grourxf  transportation  in  Boston  would  be 

(('L'  5.00  )  ('R'  nil  )  ('A'  nil  )  ('R'  nil  )) 

ELEMENTARY  RULES  FOR  CAS  COMPARISONS 

Slrirtg  comparison  Is  case-sensItIve,  but  the  distinguished  values  (YES,  NO,  TRUE,  FALSE, 
NO.ANSWER,  and  NIL)  may  be  written  in  either  upper  or  lower  case. 

Each  Indexical  position  for  a  value  in  a  tuple  (say,  the  ith)  is  assumed  to  represent  the  same 
field  or  variable  in  all  the  tuples  in  a  given  relation. 

Answer  relations  must  bo  derived  from  the  existing  relations  In  the  database,  either  by 
subsetting  and  combining  relations  or  by  operations  Ike  averaging,  summation,  etc. 

In  matching  an  hypothesized  (HYP)  CAS  form  wtth  a  reference  (REF)  on®,  the  order  of  vahm  In 
the  tuples  is  not  Important;  nor  is  the  order  of  tuples  in  a  relation,  nor  the  order  of 
alternatives  in  a  CAS  form  using  'OR*.  The  scoring  algorithm  will  use  th®  re-ordering  that 
maximizes  the  indicated  score.  Extra  valuee  in  a  tuple  are  not  counted  as  errora,  but  distinct 
extra  tuples  in  a  relation  are.  A  tuple  Is  not  distinct  if  Its  vaJues  for  the  fields  epedfled  by  the 
REF  CAS  are  the  same  as  another  tuple  in  th#  relation;  these  duplicate  tuples  are  ignored. 

CAS  forms  that  include  alternate  CAS's  connected  with  'OR*  are  Intended  to  allow  a  slrgle  HYP 
form  to  match  any  one  of  several  REF  CAS  form*.  If  th®  HYP  CAS  form  contains  altematee,  the 
.score  is  undefined. 

In  comparing  two  real  number  values,  a  tolerance  will  bs  allowed,  the  default  Is  plus  or  minus 
.01%.  No  tolerance  is  allowed  in  the  comparison  of  Integers.  In  comparing  two  strings,  initial 
and  final  sub-strings  of  white  space  are  Ignored.  In  comparing  boolean  values,  TRUE*  and 
"YES*  are  equh/alenL  as  are  'FALSE*  and  'NO*. 
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1  Introduction 


Currently  there  are  two  major  multi-iite  effort!  in  progreit  which  ihaxe  the  goal  of  providing  quan¬ 
titative  evaluation!  of  natural  language  proce!!ing.  The  Air  IVavel  Information  System  (ATIS)  ([4]) 
hai  been  developed  to  evaluate  !peech,  natural  language  and  ipoken  language  proce!!ing.  The  var- 
iouj  template  generation  (data  extraction)  taika  that  have  formed  the  baiU  for  a  lerie!  of  Mesiage 
Understanding  Conferences  (MUC!)  ([5])  have  been  de!igned  to  evaluate  text  underetanding  lyetemi. 
The  MUC  and  ATIS  effort!  have  both  been  deeigned  to  compare  the  performance  of  multiple  lyttem! 
on  a  common  blackbox  task.  In  both  effort!  the  task  of  developing  evaluation  procedure!  has  been 
extremely  valuable,  providing  new  insights  into  technical  issues  and  revesding  unanticipated  merits 
and  pitfalls  in  the  evaluation  enterprise.  It  is  safe  to  say  that  developing  evaluation  procedures  has 
required  more  effort  than  originally  had  been  anticipated  by  MUC  and  ATIS  participants,  and  that 
many  unexpected  issues  have  arisen. 

Although  the  specifiai  of  the  MUC  and  ATIS  efforts  differ,  both  have  required  their  participants  to 
deal  with  similsu  issues  in  the  areas  of  task  and  procedure  definition,  data  specification,  and  scoring. 
In  this  paper  we  describe  similarities  and  differences  In  the  solutions  that  have  been  arrived  et  for 
such  issues  in  the  two  efforts.  W«  also  describe  the  benefits  which  have  come  from  these  efforts,  and 
point  out  aspects  of  natural  language  processing  performance  which  are  not  yet  being  measured.  Our 
goal  at  one  level  is  to  document  the  issues  that  arise  in  defining  large-scale,  multi-site  evaluations. 
At  a  more  global  level  we  hope  to  provide  insights  into  general  issues  of  evaluation  based  on  the 
experience  that  we  have  gsdned  by  partidpating  in  the  MUC  and  ATIS  tasks. 

•This  work  wm  n»pport«d  by  DARPA  eoaUsct  N000014-J9-C0171  to  Udsys  Corporetioa,  adxaialsUNd  by  tb* 
Office  of  Neval  ReMMch,  by  DARPA  contrect  MDA-MJ-IS-C-OMl  to  Unisys  Coiporttioa,  edminsUred  by  tie  MeridUa 
Coipoialiott,  by  Independent  Resesxeb  and  Derslopiacnt  funding  &om  Unisys  Defense  Systems,  and  by  DARPA  contract 
N00014-W-C-0220  to  SRI.  We  Dare  PoUett  of  the  Notional  lastituU  of  Standards  and  Tecbnolofy  foe  bis  belpfol 
comments  on  a  dtoA  of  this  paper. 
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2  Goals  of  Evaluation 


In  &  blackbox  evaluation  a  dual  perspective  on  the  goals  of  the  evaluation  can  arise.  Specifically, 
is  the  intent  of  the  evaluation  to  use  performance  on  a  task  as  a  tool  to  evaluate  natural  language 
processing  or  is  the  intent  of  the  evaluation  to  measure  how  well  the  task  can  be  performed  using  any 
techniques  at  all? 

Once  a  specific  task  is  defined,  the  question  naturally  arises  of  how  well  that  task  could  be  performed 
using  techniques  other  than  natural  language.  No  task  that  we  can  use  for  evaluation  is  likely 
to  absolutely  require  natural  language  processing — there  is  always  the  possibility  that  some  other 
techniques  will  do  at  least  part  of  the  job.  If  techniques  other  than  natural  language  processing 
be  used  to  perform  the  task  then  the  evaluation  can  be  seen  as  an  evaluation  of  natural  langtuge 
processing  as  a  tool  for  solving  a  particular  problem  versiu  other  possible  tools.  This  is  quite  a 
different  goal  than  that  of  evaluating  natural  language  processing.  It  can  be  a  valuable  goal,  but  it 
is  Important  to  be  clesir  what  the  goal  is. 

The  goals  of  the  evaluation  also  determine  what  it  means  if  other  techniques  can  solve  the  problem 
well,  (This  assumes  that  we  can  decide  what  is  and  is  not  a  “natural  language  processing  technique”, 
which  can  be  controversial.)  If  the  goal  is  to  evaluate  natural  language  processing,  then  the  fact 
that  other  techniques  work  weU  might  mean  that  the  task  is  not  a  good  one  for  evaluating  natural 
language  processing.  On  the  other  hand,  if  the  goal  is  to  perform  the  task  well,  the  fact  that  other 
techniques  work  well  might  mean  that  natural  language  is  not  a  good  tool  for  this  particular  problem. 


3  Tasks 


The  AXIS  and  MUG  efforts  both  involve  blackbox  evaluation  measTires  ([3]).  For  multi-site  sys¬ 
tem  comparisons,  this  approach  to  evaluation  is  oirrently  the  only  practical  choice,  since  it  permits 
measurements  of  performance  for  systems  with  very  different  architectures  by  making  it  esuier  (al¬ 
though  by  no  means  trivial)  to  obtain  the  consensus  of  multiple  sites  on  performance  metrics,  and  by 
simplifying  the  development  of  automatic  scoring  software. 


S.l  The  AXIS  Tsuk 

The  task  \ued  in  the  AXIS  evaluation  is  database  queries  to  a  relational  database  of  information  on 
air  travel  planning — for  example,  information  about  flight  schedules,  fares  and  gremnd  transportation 
in  ten  cities.  A  typical  query  is  shown  in  Figure  1. 

A  standard  database  is  supplied  to  all  sites  participating  in  the  evaluation.  Scoring  Is  based  on 
the  answer  returned  from  the  database.  The  data  has  been  collected  using  a  WIsard  of  Os  paradigm, 
although  some  data  being  collected  currently  has  much  of  the  speech  and  language  processing  per¬ 
formed  by  lyitems,  rather  than  wirsirds.  The  data  it  used  for  evaluation  of  speech  recognition  as 
well  as  spoken  and  written  language  understanding,  although  our  focus  in  this  paper  is  on  language 
understanding. 
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U««r:  list  t.11  th«  flights  biltixsor#  sad  atltnts  on 

tuosdsyt  b«t«««ix  four  in  th«  aiftomoon  and  nin*  in  th«  ovoning  . 

Vory  v«ll.  I  sill  ash  th*  axpart  to  do  that. 
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Figure  1:  Example  ATiS  query  and  Unisys  system  response 
3.2  The  MUC  Task 

The  MUC  evaluation  effort  began  in  1987  when  DAEPA  sponsored  a  message  understanding  con¬ 
ference  directed  by  Beth  Sundheim  at  NOSC  in  San  Diego.  Participating  research  groups  at  this 
first  conference  were  reqxiired  to  report  on  the  performance  of  their  text  understanding  software  in 
processing  military  message  traffic  of  a  certain  type.  The  conference  was  successful  in  bringing  re¬ 
search  groups  together  to  work  on  a  common  domain,  but  it  was  clear  that  a  common  application 
and  scoring  methodology  were  needed  before  any  such  evaluation  effort  could  produce  consistent, 
cross-system  performance  measures. 

Darpa  consequently  sponsored  a  second  message  understanding  conference  (MUCK-2)  in  June, 
1989.  For  this  conference,  a  blackbox  template  (database  record)  generation  task  was  defined.  A 
message  domain  similar  to  the  one  xised  in  the  first  conference  was  used,  and  guidelines  for  scoring 
template  generation  performance  were  established.  The  results  reported  by  participating  research 
groups  at  this  second  conference  provided  a  concrete  view  of  the  capabilities  of  current  text  under¬ 
standing  technology. 

A  third  message  \mderstanding  conference,  MUC-3,  took  place  in  June,  1991.  For  this  conference, 
the  same  type  of  template  generation  task  was  used  that  was  introduced  in  the  second  conference. 
The  message  domain,  however,  was  chsunged  to  newspaper  articles  and  transcribed  radio  broadcasts 
and  speeches.  A  portion  of  a  typical  MUC-3  message  and  its  corresponding  template  fill  are  shown 
in  Figure  2. 

For  MUC-3,  Darpa  funded  the  development  of  a  scoring  program  so  that  the  evaluation  of  template 
generation  performance  covild  be  automated  to  the  maximum  extent  possible,  thereby  reducing  the 
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BOCOTX,  7  JUL  89  (LFE)  —  [TEXT]  COLOKBIAI  OFFICIALS  REPORT  THAT  GUERRILLAS  PRESUMED 
TO  BE  MEMBERS  OF  THE  PRO-CASTRO  ARMT  OF  lATIOIAL  LIBERATXOI  (ELI)  TOD AT  OBCE  iCill 
DTIAMITED  THE  CAIO  LIMOI-COVEHS  OIL  PIPELISE,  COLOMBIA’S  MIJOR  OIL  PIPELISE.  il 
ECOPETROL  CCOLOMBIAI  PETROLEUM  ESTERPRISE]  SPOKESMAI  SAID  THAT  THE  EXPLOSIO*  TOOK 
PLACE  AT  KM  -  102  OF  THE  PIPE  BEAR  BEIADIA  II  ARADCA  IITEIDAICT,  II  THE  EASTERI  PART 

OF  THE  coomr. 


Slot 

Description 

Filler 

0 

message  id 

TST2-MUC3-0099 

1 

template  id 

1 

2 

date  of  incident 

07  JUL  89 

3 

type  of  incident 

BOMBING 

4 

category  of  incident 

TERRORIST  ACT 

8 

perpetrator:  id  of  indlv 

“GUERRILLAS" 

6 

perpetrator:  id  of  org(t) 

"PRO-CASTRO  ARMY  OF  NATIONAL  LIBERATION" 

7 

perpetrator:  confidence 

SUSPECTED  OR  ACCUSED  BY  AUTHORITIES:  “PRO-CASTRO 
ARMY  OF  NATIONAL  LIBERATION" 

8 

physical  target:  id(s) 

“OIL  PIPELINE" 

9 

physical  target:  total  num 

1 

10 

physical  target:  type(s) 

ENERGY:  "OIL  PIPELINE" 

11 

human  target:  id(i) 

- 

12 

human  target:  total  num 

- 

13 

human  target:  type(f) 

- 

14 

target:  foreign  nation(s) 

- 

15 

instrument:  type(s) 

• 

16 

location  of  incident 

COLOMBIA:  ARAUCA  (INTENDANCY):  BENADIA  (TOWN) 

17 

effect  on  physical  target 

SOME  DAMAGE:  “OIL  PIPELINE" 

18 

effect  on  human  target(s) 

- 

Figure  2:  A  MUC*3  messtge  fragment  and  ht  associated  template  fill. 

inconsisteBcies  of  hum&n  scorers.  The  niunber  of  messages  that  text  understanding  systems  were 
required  to  process  in  the  hfUC-S  task  inaeased  by  an  order  of  magnitude  over  the  number  processed 
in  the  second  conference.  Darpa  has  already  announced  a  fourth  message  understanding  conference, 
MUC-4,  to  be  held  in  June,  1992.  For  this  conference,  the  same  domain,  taisk,  and  scoring  procedures 
will  be  used  that  were  used  for  the  MUC*3  cycle. 


8.S  Taiik  Simplification 

In  both  the  MtIC  and  AXIS  efforts  the  tasks  to  be  accomplished  were  simplified  versions  of  real  world 
applications.  Simplification  is  necessary  because  handling  a  completely  realistic  task  would  require 
building  a  great  deal  of  hardware  and  software  infrastructure  that  is  really  peripheral  to  the  goal  of 
evaluation.  On  the  other  hand,  oversimplification  must  be  avoided  or  the  behavior  evaluated  will  not 
scale  up  to  realistic  applications. 

In  ATIS,  the  task  was  simplified  in  several  respects.  First,  the  database  used  was  a  subset  of  the 
actual  air  travel  planning  database  and  did  not  include  information  about  all  cities  served.  Specifically, 
only  11  US  cities  were  included  in  the  database.  The  fact  that  some  information  is  missing  may  have 
\inknown  effects  on  the  types  of  queries  collected.  Second,  the  wirard  was  uncooperative  in  the  sense 
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that  if  the  subject  made  a  query  that  was  outside  the  domain,  the  wizard  would  only  it$ue  an  error 
message  and  would  not  attempt  to  help  the  subject  obtain  the  information.  In  addition  the  wizard 
would  never  take  the  initiative  and  help  the  subject  out  if  he  or  she  seemed  to  need  direction.  This 
simplification  makes  the  queries  easier  to  evaluate  but  at  the  cost  of  creating  a  task  that  does  not 
fully  represent  the  complexities  of  htiman-htiman  communication. 

The  MUC-3  task  was  simplified  in  that  the  systems  being  evaluated  were  not  expected  to  deal  with 
an  active  stream  of  incoming  messages.  On  the  other  hand,  the  corpus  processed  was  la  more  realistic 
in  magnitude  than  in  previous  MUG  evaluations,  and  the  task  of  template  filling  was  acknowledged 
by  government  observers  to  be  representative  of  true  applications. 


4  Procedures 


MUG  and  ATI3  arc  Lvge- scale,  cyclic  evaluation  programs  involving  a  number  of  independent  research 
groups.  For  such  large  scale  efforts  it  is  necessary  to  have  a  more  or  less  formal  process  for  making 
decisions  and  administering  the  evaluation. 


4.1  Development  of  the  Applicationt  and  Domains 

The  MUG  program  directed  by  Beth  Sundheim  at  NOSC  under  the  sponsorship  of  DABPA  is  now 
planning  its  fourth  cycle  of  evaluation.  Evaluations  have  taken  place  in  May  of  1987,  1989,  and  1991. 
These  cycles  vary  somewhat  in  application,  domain,  and  complexity.  But  these  variations  are  not  me* 
thodical;  they  have  resulted  bom  an  evolving  view  of  what  an  evaluation  cycle  for  text  understanding 
systems  ought  to  be.  The  ATIS  program,  administered  by  NIST  through  an  interagency  agreement 
with  DARPA,  is  a  more  recently  established  evaluation  effort  than  MUC.^  Evaluations  have  taken 
place  in  June  of  1990  and  February  of  1991,  and  the  third  evaluation  is  plaimed  for  February  of  1992. 
These  evaluations  have  exhibited  less  variation  than  those  of  MUC.  Each  cycle  has  focussed  on  the 
same  application  and  domain,  and  the  same  basic  scoring  procedure  has  been  used. 

The  tendency  to  use  the  same  application  and  domain  that  is  evolving  in  the  MUC  effort  and  that 
has  been  present  from  the  outset  of  the  ATIS  effort  has  the  advantage  of  allowing  more  informative 
intercycle  performance  measurw.  However,  this  advantage  has  a  price.  By  not  varying  the  application 
and  domain  between  cycles,  it  is  difficult  to  evaluate  portability,  at  lesut  in  the  sense  of  porting  speed. 


4.2  Coordination 

Although  the  MUC  and  ATIS  programs  share  the  tendency  to  use  the  same  application  and  domain 
aaoss  evaluation  cycles,  they  differ  in  the  nature  of  the  roles  played  by  the  program  coordinators 
and  by  the  participants  in  the  evaluation.  In  MUC,  Beth  Sondheim  has  been  the  dominant  force  in 
shaping  the  direction  taken  in  the  evaluation  effort.  Although  she  has  created  committees  to  help 
her  make  decisions,  she  has  been  the  driving  force  In  setting  the  rules  for  participation,  defining  the 
evaluation  tasks,  locating  data,  aeating  scoring  guidelines,  and  so  forth.  In  the  ATIS  effort,  on  the 

‘Although  benchmark  tesliag  for  tpesch  tvaluatioa  begin  in  March  of  19IT,  tvalualion  of  tpoksa  languagi  nrwUt* 
stixuimi  (usisg  th«  ATIS  domain)  began  in  1990. 
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other  hand,  although  NIST  coordinatea  the  overall  evaluation,  there  arc  alto  teveral  working  group* 
which  make  recommendatioai  on  data  collection,  evaluation  and  the  relational  d^tabaie. 


4.3  Documentation  and  Scoring  Software 

In  both  the  MUC  and  AXIS  evaluation  efforts  automatic  scoring  software  been  tised.  Although 
this  has  probably  been  an  important  component  in  keeping  the  evaluations  objective  and  relatively 
inexpensive,  it  introduces  another  factor  into  the  evaluation  process  that  is  peripheral  to  the  research. 
The  selection  of  scoring  software  may  also  affect  the  amount  of  time  any  given  research  group  has  to 
prepare  for  an  evaluation — research  groups  that  differ  in  their  familiarity  with  the  scoring  software 
will  differ  in  the  amount  of  time  they  spend  reading  scoring  documentation  and  learning  how  to  use 
scoring  software.  This  situation  arose  in  the  MUC  effort  when  a  scoring  program  built  on  top  of  GNU 
emacs  was  adopted.  Research  groups  that  were  not  familiar  with  this  text  editor  spent  sif^cantly 
more  time  installing  the  software  and  learning  how  to  lue  it. 

In  defining  the  evaluation  procedures  in  both  the  MUC  and  AXIS  efforts,  many  decisions  about 
various  details  were  made.  It  is  very  important  to  document  these  decisions  in  order  to  allow  new 
groups  to  participate  in  the  evaluation.  Familiarity  with  scoring  techni<{uu  and  familiarity  with  the 
current  application  and  domain  of  a  large-scale  effort  will  have  a  significant  effect  on  the  entry  cost 
for  new  research  groups  that  have  decided  to  participate  in  an  evaluation  program.  The  entry  cost 
will  vary  from  program  to  program,  and  the  desire  to  minimixe  the  cost  nuy  vary  as  wdL 


5  Data 


One  of  the  primary  areas  in  which  the  evaluation  trend  has  benefited  the  progress  of  research  in 
natural-language  processing  systems  is  where  it  has  pointed  out  the  clear  necessity  for  developing 
methods  to  deal  with  naturally  occurring  text,  with  complicated  structure,  ungrammaticalities,  and 
long  sentences.  Whether  one  is  developing  a  ‘*hand-crafted’*  system,  or  applying  statistical  methods 
to  adapt  a  system  to  the  types  of  texts  typical  of  a  certain  domain,  it  is  necessary  to  have  a  great 
deal  of  data  to  provide  exemplars  of  the  phenomena  that  arise  and  to  provide  a  statistically  adequate 
sample  of  texts  in  the  domain. 

For  the  MUC  «vlluatioI^  obtaining  data  is  a  relatively  simple  task,  since  newspaper  articles  on  any 
topic  are  readily  obtainable  In  large  quantitiu  from  any  wire  service.  Although  naturally  occurring 
raw  data  is  abundant  and  easy  to  obtain,  this  data  is  not  necessarily  useable  for  system  development 
without  considerable  processing.  First,  accurate  answer  templates  for  all  of  the  texts  must  be  ob¬ 
tained.  If  systems  vrith  to  use  the  collected  data  for  training  statistical  models,  it  it  also  desirable 
to  have  lexical  category  tags  and  structural  bracketing  accurately  assigned — a  very  time-consuming 
task. 

The  problems  of  gathering  data  for  the  AXIS  corpus  were  somewhat  different.  Because  there  are 
no  existing  systems  that  perform  as  the  envisioned  AXIS  systems  do,  it  is  necessary  to  set  up  artificial 
“WUard  of  Oi”  scenarios  in  which  a  subject  interacts  with  a  person  pretending  to  be  a  computer. 
This  technique  can  yield  much  data^  however,  the  exact  characteristics  of  the  data  are  very  sensitive 
to  the  precise  protocols  under  which  the  data  is  collected.  Very  small  difference*  In  the  mental  state  of 
the  subject  (e.g.  whether  or  not  the  subject  is  aware  that  the  experiment  is  a  simulation,  or  believes 
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he  or  she  is  actually  talking  to  a  machine)  har  a  significant  effect  on  the  linguistic  phenomena  that 
are  evident  in  the  data.  The  same  observation  holds  true  for  the  task  that  is  assigned  to  the  user  in 
the  data  collection  experiment.  Because  the  experiment  requires  presenting  an  answer  interactively 
to  the  user,  the  correct  AXIS  answers  are  collcctsd  as  a  byproduct  of  running  the  scenario. 

The  AXIS  data  was  originally  collected  by  several  sites  under  contract  to  NIST,  which  compiled 
the  data  and  distributed  it  to  the  participating  system  developers.  However,  sites  working  on  AXIS 
are  now  moving  to  a  multi-site  data  collection  paradigm,  where  data  is  collected  by  the  participating 
sites  themselves  sind  contributed  to  a  common  pool  This  common  data  collection  paradigm  serves 
both  to  diversify  the  training  data  that  sites  will  sec  .>ad  to  reduce  the  possibility  of  gaps  in  the 
data  flow  because  of  problems  at  one  or  two  siter.  However,  data  collected  at  multiple  sites  requires 
additional  efforts  in  standardization  and  coordination  to  make  sure  that  the  data  are  consistent. 

The  raw  MUC  data  (the  original  texts)  was  collected  by  NOSC,  but  the  generation  of  key  teiapiates 
was  done  by  the  participants  themselves,  which  proved  to  be  time-consuming  and  distracting  for  all 
involved.  Part  of  speech  tagging  and  bracketing  of  a  nbset  of  the  data  was  undertaken  by  the 
TREEBANK  project  ((!]). 

Because  of  the  labor  required  to  run  the  data  collection  experiments,  the  AXIS  evaluation  is 
relatively  data  poor,  and  many  sites  have  supplemented  the  official  data  with  data  collected  on  their 
own.  MUC  participants,  on  the  other  hand,  found  themselves  with  too  much  data  in  one  sense 
and  not  enough  in  another.  Because  raw  data  is  easily  obtainable  in  large  quantities,  there  was  no 
shortage.  However,  some  sites  found  that  a  shortage  of  processed  data  with  accurate  answer  keys, 
part  of  speech  tags  and/or  bracketing  hindered  their  development  efforts. 

One  interesting  difference  between  the  MUC  and  AXIS  data  collection  efforts  was  that  because  the 
MUC  data  was  provided  all  at  once,  many  economies  of  scale  Li  processing  the  data  were  possible 
which  weren't  cost  effective  in  handling  the  AXIS  data,  which  provided  in  smaller  increments. 
For  example,  in  MUC  it  was  cost  effective  to  ust  batch  lexical  entry  tools  for  entering  hundreds  of  new 
words  at  a  time  while  in  AXIS,  entering  a  few  new  words  at  a  tiinc  as  the  data  arrived  didn’t  justify 
the  itnrt  up  overhead  of  using  batch  tools.  In  addition,  with  a  large  amount  of  data  it  is  possible 
to  more  reliably  prioritize  development  efforts  on  the  basis  of  frequency  of  occurrence  of  linguistic 
phenomasA.  A  large  amount  of  data  is  helpful  in  determining  to  what  extent  observed  phenomena 
are  sporadic  or  are  really  representative  of  the  domain. 


6  Scoring 


The  intended  application  of  a  system  determines  what  evaluation  metrica  arc  most  appropriate.  Since 
the  AXIS  application  (interactive  database  information  retrieval)  and  the  MUC  application  (off-line 
text  information  extraction)  are  considerably  different,  it  is  not  surprising  that  the  evaluation  metrics 
chosen  for  them  are  quite  different. 

Since  the  intended  AXIS  application  is  interactive  question  answering,  the  ability  of  the  system  to 
produce  correct  answara  to  queries  is  the  relevant  capability  to  be  measured.  Correct  answers  are 
assumed  to  reflect  the  correct  processing  of  all  relevant  aspects  of  the  Input,  and  incorrect  answers 
reflect  the  failure  to  handle  some  relevant  aspect  of  the  query,  whether  in  speech  recogrution  or 
subsequent  NL  processing.  For  this  application,  wrong  answers  were  considered  twice  as  bad  as  no 
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answer  st  all,  hence  the  overall  scoring  metric:  Score  =  Right  -  Wrong. 

Because  the  AXIS  systems  are  intended  to  function  eventually  in  as  interactive  setting,  it  may  be 
considered  helpful  to  a  user  to  provide  information  that  is  not  explicitly  requested.  Therefore  the 
scoring  program  has  been  quite  tolerant  of  the  inclusion  of  additional  databue  fields  not  explicitly 
requested  in  the  answer,  even  though  this  tolersince  could  conceal  a  system's  inability  to  determine 
exactly  what  it  was  that  the  user  requested.  Recently  this  tolerance  has  come  into  question  and 
a  stricter  scoring  methodology  is  being  developed  which  will  penalise  a  system  for  exceeding  the 
mayimuTn  answer  for  a  query. 

The  MUC  task,  on  the  other  hand,  involves  not  question  answering,  but  data  extraction.  The 
systems  were  evaluated  on  their  ability  to  correctly  recover  18  different  types  of  information  from 
each  text,  and  therefore  the  score  involves  measures  of  recall,  the  overall  percentage  of  correct  fills, 
and  precision,  the  percentage  of  correct  fills  greater  than  those  offered.  The  measurement  of  precision 
was  further  decomposed  into  measurements  of  overgeneration,  which  is  the  percentage  of  spurious 
(false  positive)  slot  fills  of  all  fills  offered,  and  fallout,  which  pertains  only  to  those  slots  with  a  finite 
number  of  possible  values,  and  is  a  measure  of  the  tendency  of  a  system  to  make  errors  as  the  number 
of  possible  options  increases. 

In  addition  to  scoring  performance  for  each  of  the  18  template  slots  in  the  MUC-3  task,  the  collective 
performance  over  all  the  slots  was  calciilated.  However,  a  known  flaw  with  the  MUC*3  evaluation 
metrics  is  that  there  are  both  logical  and  statistical  dependencies  among  fillers  for  various  slots,  and 
this  confounds  methods  that  tend  to  treat  the  filling  of  any  particular  slot  as  an  independent  event. 

Like  the  AXIS  task,  the  MUC  task  scoring  procedures  allow  some  flexibility  in  accepting  “addi¬ 
tional"  information  in  that  answers  may  spcvJy  “optional”  fiHs  that  do  not  contribute  either  to  recall 
or  overgeneration  scores.  Because  tome  template  slots  are  filled  with  strings  from  the  text,  and  the 
criteria  for  determining  the  correct  string  fill  are  incomplete,  the  participants  in  MUC-3  were  allowed 
to  score  inexact  matches  as  partially  correct,  which  counted  as  half  of  a  correct  answer. 

The  end-user  application  of  the  MUC  task  is  less  clear  than  that  for  the  AXIS  task,  and  so  it  is 
much  less  clear  how  to  accurately  characterize  a  system's  performance  with  a  single  number.  Different 
tradeoffs  between  recall,  precision,  and  computation  may  be  appropriate  for  different  applications, 
and  therefore  there  is  no  clear  criteria  for  the  comparison  of  systems  that  choose  a  different  tradeoff 
point  between  recall  and  precision.  Although  most  developers  have  tended  to  prefer  a  strategy 
that  emphasizes  precision  at  the  expense  of  recall,  the  stipulated  goal  of  MUC-3  was  a  balanced 
maximization  of  both  parameters. 

Neither  MUC  nor  AXIS  has  yet  used  statistical  significance  tests  on  scores.  Statistical  significance 
tests  would  be  helpful  because  they  would  tell  us  both  when  apparently  small  differences  in  fact  do 
reflect  reliable  differences  between  system  performance,  and  when  apparently  large  differences  do  not 
reflect  reliable  differences  between  system  performance. 


7  Benefits 


With  these  formal  evaluation  programs,  we  arc  beginning  to  be  able  to  compare  natural  language 
processing  techniques  in  an  objective,  quantitative  way.  However,  because  spoken  language  systems 
and  text  processing  systems  are  enormously  complex,  the  single  number  or  small  set  of  numbers  w 
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Figure  3:  Desirable  properties  of  natural  language  systems  and  evaluations 

represent  &  lystem’s  performance  it  a  very  coarse  measure  of  the  value  of  the  various  algorithms 
which  make  up  the  system.  For  this  reason  we  believe  it  it  important  to  perform  parametric  intra- 
system  comparisons.  By  this  we  mean  that  scores  are  presented  from  a  single  system  using  different 
components.  For  example,  at  the  most  recent  ATIS  evaluation,  Unisys  performed  three  tests,  using 
the  same  natural  language  system  with  three  different  speech  recogniiers  ([2]).  This  provided  a  clean, 
controlled  comparision  of  the  performance  of  the  speech  recogniiers.  Similarly,  CMU  presented  scores 
of  the  performance  of  their  system  with  and  without  a  knowledge-based  module  {[6]). 

8  Other  Evaluations 

Both  MUG  and  AXIS  are  aimed  at  comparing  basic  language  understanding  capabilities  across  sys¬ 
tems.  Other  properties  of  natural  language  understanding  systems  are  important  as  well,  and  it  is 
important  not  to  lose  site  of  these  s^  we  become  more  successful  in  executing  formal  evaluations  like 
MUG  and  AXIS.  For  example,  we  want  to  know  how  systems  compare  on  usability  and  portability, 
as  well  SIS  how  their  different  components  compsire,  in  addition  to  the  basic  natural  language  tm- 
derstsinding.  Evalriating  just  what  we  know  how  to  evaluate  is  like  looking  for  the  keys  under  the 
lampost.  We  don’t  want  to  look  for  the  keys  just  under  the  lamp  post — we  want  to  turn  on  more 
lights.  At  Figure  3  suggests,  there  are  several  lights  that  we  may  want  to  turn  on. 


9  Conclusions 

By  participating  in  these  evaluations,  we  have  learned  that: 

#  The  evaluation  infrastructure  is  complex  and  building  it  requires  active  participation  from  the 
sites  being  evaluated. 
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•  The  cost  of  evaluation  is  high  in  terms  of  both  amo-nt  of  time  spent  on  the  evaluation  and 
diversion  of  researchers’  energies  from  other  activities.  As  natural  language  systems  become 
more  portable,  these  costs  may  be  reduced,  since  less  time  will  be  spent  on  routine  porting. 

•  The  scientific  value  of  these  evaluations  is  tremendous,  but  we  need  to  continue  exploring  other 
forms  of  evaluation  in  order  to  get  an  evaluation  of  other  desirable  properties  of  natural  language 
systems. 


Multi-site  evaluation  of  natural  language  processing  systems  as  in  the  AXIS  and  MUC  efforts  has 
been  extremely  stimulatmg  to  the  field  of  natural  language  processing  in  the  four  years  since  the 
first  MUC  evaluation.  At  the  same  time  an  enormous  amoimt  of  effort  has  gone  into  defining  the 
requirements  and  procedures  for  carrying  out  the  evaluations.  In  this  paper  we  have  tried  to  document 
the  requirements  of  successful  multi- site  black  box  evaluations  from  our  perspective  as  participants 
in  both  the  MUC  and  ATIS  evaluations  so  that  future  evaluations  can  build  on  these  experiences. 
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1>  Introduction 

The  evaluation  of  natural  language  processing  (NLP)  systems  has  become  an  issue  of  in¬ 
creasing  concern  within  the  computational  linguistics  community  and  the  producers  and 
consumers  of  NLP  products.  Evaluation  of  NLP  systems  is  essential  in  order  to  measure  ca¬ 
pabilities  and  track  improvement  in  individual  systems,  compare  different  systems,  measure 
technical  progress  and  growth  in  the  field,  and  provide  a  basis  for  selecting  NLP  systems  to 
best  fit  the  communication  requirements  of  applications  and  applications  systems. 

The  objectives  of  the  Benchmark  Investigation/Identification  (Benchmark  I/I)  Project  are 
designed  to  support  these  evaluation  activities.  As  part  of  the  Benchmark  I/I  Program,  we 
are  developing  a  method  and  procedure  for  evaluating  NLP  systems  that: 

•  produces  profiles  of  NLP  systems  that  are: 

-  descriptive;  the  profiles  provide  descriptive  information  with  regard  to  the  types 
of  linguistic  phenomena  on  which  the  NLP  succeeded  or  failed,,  not  just  one  or 
two  numerical  scores  (e.g.,  recall  and  precision)  that  provide  no  detailed  analysis. 

-  hierarchically  organized:  the  capabilities  of  NLP  systems  are  described  by  individ¬ 
ual  capability  as  well  as  by  class  of  capability,  at  the  various  levels  of  granularity 
provided  by  the  hierarchical  structure  of  the  profile. 

-  quantitative:  scores  assigned  by  evaluators  to  individual  test  items  are  aggregated 
by  class,  and  weighted  averages  are  used  to  calculate  a  numerical  performance 
score  for  each  class  in  the  hierarchy. 

'This  research  is  supported  by  Rome  Laboratory  under  Contract  No.  F30602-90-C-0034. 
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—  objective:  test  items  are  defined  in  a  detailed  manner  to  remove  evaluator  sub¬ 
jectivity. 

»  is  usable  across  application  domains. 

•  is  applicable  across  different  types  of  NLP  systems  such  as  database  query  NL  front-end 
systems,  text/message  processing  systems,  and  interactive  NL  dialogue  interfaces. 

♦  does  not  require  an  NLP  system  to  be  modified,  re-engineered,  or  re-implemented  to 
adapt  it  to  a  particular  text  corpus  or  domain.  This  unique  feature  sets  the  Benchmark 
I/I  approach  apart  from  others.  Other  evaluation  efforts  (e.g.,  the  MUG  evaluations) 
provide  a  domain  or  text  corpus  to  which  NLP  systems  must  be  adapted  or  ported 
even  though  this  porting  may  be  very  costly. 

i  is  repeatable;  the  Procedure  produces  consistent  results,  independent  of  the  evaluator. 

j  does  not  require  that  the  evaluator  be  a  trained  linguist. 

»  is  unbizised  with  respect  to  linguistic  theories,  system-internal  processing  methods,  and 
knowledge  representation  techniques. 

The  Benchmark  I/I  Evaluation  Procedure  is  being  designed  to  produce  comprehensive  de¬ 
scriptive  evaluation  profiles  for  NLP  systems.  Such  a  comprehensive  profile  should  be  in¬ 
terpreted  in  terms  of  the  application  requirements  for  which  the  NLP  system  will  be  used. 
Since  the  Benchmark  I/I  Procedure  is  being  designed  to  be  comprehensive  and  to  be  appli¬ 
cable  across  different  application  domains  and  different  NLP  system  types,  one  would  not 
necessarily  expect  a  particular  type  of  NLP  system  to  excel  in  all  areas.  For  example,  a  text 
processing  system  that  performs  well  at  information  extraction  to  update  a  database  may 
not  process  NL  queries  or  commands.  On  the  other  hand,  a  database  query  NL  front-end 
system  may  perform  extremely  well  at  processing  NL  queries  and  commands,  but  may  not 
process  declarative  sentences. 

This  paper  discusses  the  content  and  structure  of  the  Benchmark  I/I  Evaluation  Procedure 
and  the  results  of  assessing  the  Procedure  in  the  conttxl  .•>f  applying  it  to  each  of  three 
NLP  systems  by  each  of  three  Interface  Technologists  at  the  end  of  the  first  six-month  phase 
of  the  project.  Section  2  discusses  background  and  scope.  Section  3  briefly  presents  an 
overview  of  the  Benchmark  I/l  Project.  Section  4  presents  the  design  principle  underlying 
the  Evaluation  Procedure  and  its  organization.  Section  5  describes  the  Procedure’s  content 
Sind  scoring  method  and  Section  6  discusses  the  profiles  produced  by  the  Procedure.  Section  7 
reports  on  the  assessment  of  the  Evaluation  Procedure.  Section  B  presents  the  current  status 
of  the  project  and  future  directions.  Section  9  summarizes  the  important  issues  discussed  in 
this  paper  and  Section  10  provides  references.  The  appendix  includes  a  profile  of  an  NLP 
system  produced  by  the  Benchmark  I/I  Procedure. 

As  discussed  in  Section  3,  this  paper  reports  on  the  results  of  only  the  first  six-month  phase 
of  an  18-month  project.  So  the  reader  should  keep  in  mind  that  although  the  Procedure 
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is  described  and  results  are  reported  from  the  application  of  the  Procedure  for  assessment 
purposes,  this  work  is  not  complete. 


2.  Background 

There  are  many  different  areas  and  issues  for  which  NLP  systems  need  to  be  evaluated.  Table 
1  categorizes  and  lists  many  of  these  issues.  The  problems  in  evaluating  NLP  systems  arc  dif¬ 
ficult  and  many  approaches  to  these  different  issues  have  been  discussed  [BatesQO],  [BBN88], 
(BiermannSS],  [Flickinger87j,  [Guida86],  [Hayes-Roth89],  [Hendrix76],  [Hershman79),  [Hix91), 
(Kohoutek84],  jLazzara90],  [Malhotra75],  (Mitta91],  [Ogden88],  [Palmet89],  [Read88],  [Sund- 
heim91],  (Tennant79],  [WeischedeI86].  The  Benchmark  Evaluation  Procedure  focuses  on  the 
linguistic  issues  listed  in  the  first  column  of  the  table.  The  following  paragraphs  briefly 
review  some  of  the  related  evaluation  efforts  and  approaches. 


Table  1:  Categories  of  Evaluation  Issues 


Linguistic 

Issues 

Intelligent  Behavior 
&  Reasoning  Issues 

End  User 
Issues 

System  Development 
Issues 

lexicon 

syntax 

semantics 

discourse 

pragmatics 

inference 

learning 

cooperative  dialogue 
speaker/hcarcr  modeling 
real  world  knowledge 

habitability 

reliability 

likability 

efficiency 

extensibility 

quality  of  tools 
cost 

ease  of  development 
maintainability 
portability 
integrability 

Several  studies  have  focused  on  the  issue  of  habitability.  In  laboratory  evaluations,  Hersh- 
man,  Kelly,  and  Miller  (Hershman79]  studied  ten  Navy  officers  using  LADDER,  a  natural 
language  query  system  designed  to  provide  easy  access  to  a  naval  database.  The  study  simu¬ 
lated  the  actual  operational  environment  in  which  LADDER  would  be  used  and  the  subjects 
were  trained  to  the  database  and  LADDER  interface.  The  results  of  the  study  indicated 
that  the  extensive  training  given  to  the  subjects  was  adequate  for  trsdning  the  functional  and 
conceptual  coverage  of  the  system,  but  not  for  training  the  syntactic  and  lexical  coverage. 

Focusmg  on  habitability  and  efficiency,  Biermann,  Ballard,  and  Sigmon  [Biermann83]  de¬ 
signed  an  experiment  that  was  concerned  with  the  usefulness  of  English  as  a  programming 
language.  Their  experiment  used  a  natural  language  programming  system,  called  NLC,  that 
allows  a  user  to  display  and  manipulate  tables  and  matrices  while  at  a  display  terminal. 
All  user  inputs  were  expressed  in  English.  The  results  of  the  study  indicated  that,  with 
relatively  little  training  on  NLC,  subjects  were  able  to  type  system-acceptable  syntax  with 
a  high  enough  success  rate  to  obtain  correct  answers  in  a  reasonable  amount  of  time. 

The  Performance  Evaluation  Plan  (PEP)  (Lazzara90]  addresses  the  evaluation  of  NLP  tools 
or  shells  for  developing  specific  NLP  applications  from  a  user-oriented  perspective,  where 
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three  classes  of  users  are  identified-  systems  developers,  end  users,  and  systems  maintain- 
ers.  As  a  result,  the  PEP  provides  a  methodology  for  evaluating  issues  such  is  integrity, 
maintainability,  extendability,  portability,  user  productivity  and  likability. 

lives  Roth  [Hayes- Roth89]  and  Mltta  [Mitta9l]  are  concerned  with  evaluation  of  knowledge 
systems  and  expert  systems.  Hayes-Pi,oth  [Hayes- Roth89]  is  primarily  concerned  with  extrin- 
.i.  issues  such  as  advice  quality,  reasoning  correctness,  robustness,  and  solution  efficiency; 
■>.ud  intrinsic  issues  such  as  elegance  of  knowledge  base  design,  modularity,  and  architecture, 
.‘ifitta  [Mitta91]  discusses  a  methodology  for  evaluating  an  expert  system’s  usability,  based 
on  the  following  six  variables  or  measures,  user  confidence  that  the  solution  is  correct,  user 
perception  of  difficulty,  correctness  of  solution,  number  of  responses  required  of  users,  inabil¬ 
ity  of  expert  system  to  provide  a  solution,  and  rate  of  help  requests.  Although  focusing  on 
knowledge  systems  or  expert  systems,  these  discussions  and  methodologies  are  applicable  to 
NLP  systems  because  NLP  systems  are  special  types  of  knowledge  systems. 

Several  approacnes  and  studies  focus  on  linguistic  and  NL  understanding  capabilities.  Guida 
and  Mauri  [GuidaSG]  have  developed  a  formal  and  detailed  method  for  evaluating  NLP 
systems.  They  treat  a  NLP  system  as  a  function  from  a  set  of  input  expressions  to  one 
or  more  sets  of  outputs.  Their  method  requires  a  measure  of  error,  defined  to  compare 
the  closeness  of  the  output  with  the  correct  output,  and  a  measure  of  the  importance  of 
:ach  input.  Their  method  of  evaluation  compu.es  the  sum  of  the  errors  weighted  by  the 
importance  of  the  input. 

Several  approaches  that  focus  on  lingul.'ic  capabilities  have  entailed  the  development  of  test 
corpora  for  evaluating  NL  database  query  interfaces  [BBN88j,  [Hendrix76],  (MalhotraTS),  and 
[Flickinger87].  Flickingcr,  Nerbonne,  Sag,  and  Wasow  [Flickinger87]  developed  a  test  suite 
of  English  sentences,  annotated  by  construction  type,  that  covers  a  wide  variety  of  syntactic 
and  semantic  phenomena.  The  test  suite  reflects  grammatical  issues  with  which  linguists 
have  been  concerned  for  a  considerable  length  of  time.  Anomalous  strings  are  included  as 
well  as  well-formed  sentences. 

As  part  of  the  Artificial  Intelligence  Measurement  System  (AIMS)  project  [Read88),  evalu¬ 
ation  criteria  and  methods  for  describing  linguistic  coverage  are  being  developed  for  NLP 
systems.  As  a  result,  a  Sourcebook  [Read88]  is  being  developed  that  consists  of  a  database 
of  “exemplars”  of  representative  problems  in  NL  processi.ng.  Each  exemplar  includes  a  piece 
of  illustrative  text,  description  of  the  linguistic/conceptual  issue  at  stake,  discussion  of  the 
problems  In  understanding  the  text,  and  references  to  more  extensive  discussion  in  the  lit¬ 
erature. 

The  Naval  Ocean  Systems  Center  (NOSC)  completed  the  third  evaluation  of  English  text 
processing  systems  in  May,  1991,  with  the  Third  Message  Understanrling  Conference  (MUC- 
3)  [Sundheim91].  These  evaluatioru  focused  on  the  performance  of  text  analysis  systems  on 
an  information  extraction  task.  The  training  corpus  consisted  of  1300  texts  with  an  overall 
size  of  over  2.5  megabytes.  The  task  was  to  extract  information  on  terrorist  incidents  from 
relevant  text.  At  the  end  of  each  phase  of  MUC-3,  participating  systems  were  reqiured  to 
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extract  information  from  a  test  corpus  of  100  previously  unseen  texts.  Scoring  results  for  both 
phases,  examples  of  the  development  and  test  corpora,  and  descriptions  of  the  participating 
systems  are  given  in  the  proceedings  [SundheimOl). 

tinally,  an  important  issue  is  the  reliability  of  evaluation  methods.  That  is,  different  evalua¬ 
tors  must  produce  consistent  results  when  applying  the  saune  evaluation  method  to  the  same 
irget  system.  Although  not  directly  concerned  with  NL  processing,  the  approach  of  Hix 
%nd  Schulman  [HixQl]  for  testing  the  reliabiBty  of  their  methodology  for  evaluating  human- 
.omputer  interface  development  tools  is  relevant.  To  empirically  test  their  methodology, 
Hix  and  Schulman  had  six  evaluators  each  apply  the  method  to  two  (out  of  a  total  three) 
application  tools,  so  that  each  tool  was  evaluated  by  four  different  participants.  To  produce 
statistical  tests  of  reliability,  the  researchers  computed  the  probability  that  responses  from 
’.he  four  evaluators  for  each  tool  would  match  by  chance.  The  observed  proportion  of  matches 
lor  each  category  of  items  was  compared  with  the  chance  probability  using  a  binomial  test. 

The  Benchmark  I/I  evaluation  method  focuses  on  the  linguistic  capabilities  of  NLP  systems. 
Important  components  of  the  Benchmark  I/I  Procedure  are  the  hierarchically  structured 
classification  scheme  for  linguistic  phenomena,  emphasis  on  descriptions  of  the  linguistic 
phenomena  covered  in  the  Procedure,  and  examples  illustrating  linguistic  phenomena.  In 
these  aspects,  the  Benchmark  I/I  method  has  some  similarity  to  the  Sourcebook  approach 
of  Read  et  al.  [ReadSSj.  The  Benchmark  I/I  method  also  provides  a  Procedure  for  testing 
whether  NLP  systems  are  capable  of  handling  the  described  linguistic  phenomena.  The  Pro¬ 
cedure  includes  patterns,  instructions,  and  illustrative  examples  for  composing  NL  text  for 
testing  purposes  and  a  Profile  generator  that  produces  descriptive  profiles  of  NLP  systems 
organized  according  to  the  hierarchically  structured  classification  scheme  for  linguistic  phe¬ 
nomena.  We  are  also  designing  a  reliability  test  for  the  Benchmark  I/I  Evaluation  Procedure 
that  is  somewhat  similar  to  that  of  Hix  and  Schulman  [Hix91].  In  constrast  to  most  of  the 
approaches  discussed  above,  with  the  exception  of  the  Sourcebook  [ReadSS],  the  Benchmark 
I/I  evaluation  tool  is  being  designed  to  be  applicable  to  different  types  of  NLP  systems  and 
across  application  domains. 

The  following  section  provides  an  overview  of  the  Benchmark  Project.  Subsequent  sections 
discuss  the  design  and  content  of  the  Benchmark  I/I  Evaluation  Procedure,  evaluation  Pro¬ 
files,  experience  in  applying  the  Procedure,  and  assessment  of  the  Procedure. 

3.  The  Benchmark  Investigation/Identification  Project 

The  Benchmark  Project  is  an  eighteen-month  project  that  includes  the  following  key  tasks: 

•  development  of  an  Evaluation  Procedure  that  produces  pro.Tes  of  NLP  systems  con¬ 
sisting  of  hierarchically  organized,  quantitative,  objective  descriptions  of  the  systems’ 
capabilities.  Supporting  this  task  is  the  development  of: 

--  a  database  of  non-subjective  Descriptive  Terminology  for  describing  NLP  capa¬ 
bilities  outside  the  context  of  their  application  to  twgct  software. 


45 


-  a  Classification  Scheme  for  NLP  capabilities  and  issues  that  provides  the  hierar¬ 
chical  organization. 

-  a  Procedure  to  guide  the  evaluator  through  the  evaluation  process.  This  Pro¬ 
cedure  assists  the  evaluator  in  developing  test  sentences  and  provides  for  the 
recording  of  results/scores. 

•  assessment  of  the  Evaluation  Procedure  at  the  end  of  each  of  the  three  six-month  de¬ 
velopment  phases.  This  assessment  activity  consists  of  having  Interface  Technologists, 
who  have  had  no  involvement  with  the  development  of  the  Evaluation  Procedure,  apply 
the  Procedure  to  several  actual  NLP  systems.  More  specifically,  each  of  three  Interface 
Technologists  apply  the  Procedure  to  each  of  three  NLP  systems  at  each  of  the  three 
milestones  shown  in  Figure  1.  The  results  of  the  assessment  by  the  Interface  Tech¬ 
nologists  provide  feedback  to  the  developers  of  the  Procedure  during  its  incremental 
development. 
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Figure  1;  Assessment  Milestones 

These  issues  are  discussed  in  more  detail  in  the  following  sections. 

4.  Design  Principle  and  Procedure  Structure 
Design  Principle 

The  design  of  the  Evaluation  Procedure  is  based  on  the  principle  of  having  each  NLP  ca¬ 
pability,  C,  tested  in  at  least  one  Procedure  item  that  indudes  no  other  ‘intruding”  NLP 
capability  that  might  obscure  the  system’s  performance  on  the  focal  capability  C  for  the 
particular  test  item.  To  accomplish  this  type  of  design  for  the  Procedure,  the  Procedure  is 


being  developed  so  that,  for  untested  individual  NLP  capabilities,  each  Procedure  item  tests 
just  one  untested  NLP  capability  at  a  time,  to  the  extent  possible;  combinations  are  tested 
after  individual  capabilities  are  tested.  Thus,  the  Procedure  is  being  designed  to  progress 
from  very  elementary  sentence  types  containing  simple  constituents  to  more  complex  sen¬ 
tence  (group)  types.  The  idea  is  that  each  time  a  test  sentence  (group)  is  presented  to  the 
NLP  system  being  evaluated,  the  sentence  (group)  should  contain  only  one  new  (untested) 
hnguistic  capability  or  one  new  untested  combination  of  tested  capabilities.  The  other  ca¬ 
pabilities  required  for  processing  the  input  should  already  have  been  tested  and  the  NLP 
system  should  already  have  succeeded  on  these  other  issues.  The  Procedure  must  avoid 
the  situation  in  which  tests  for  several  capabilities  are  always  combined  in  the  same  test 
sentences,  since  the  Procedure  would  then  be  insensitive  to  the  individual  capabilities.  For 
example,  a  test  of  ellipsis  only  in  the  context  of  question- answering  dialogue  would  not  be 
usable  with  a  system  that  is  not  designed  to  handle  questions  (e.g.,  a  text  understanding 
system  designed  for  an  information  extraction  task,  which  typically  processes  declarative 
sentences,  but  not  interrogatives  or  imperatives). 

Descriptive  Terminology 

In  any  evaluation  effort,  it  is  important  to  identify  and  define  the  evaluation  criteria.  The 
Benchmark  project  focuses  on  linguistic  capabilities;  that  is,  the  ability  of  NLP  systems  to 
process  the  various  constructs  and  phenomena  of  natural  language.  As  part  of  this  project, 
we  are  developing  a  database  of  Descriptive  Terminology  to  describe  the  language  constructs 
and  features  for  which  NLP  systems  are  tested  in  the  Benchmark  Evaluation  Procedure. 
This  Terminology  is  being  developed  from  the  literature  on  linguistics  and  computational 
linguistics.  Definitions  are  based  on,  or  selected  from,  well-respected  literature  sources.  This 
terminology  is  used  throughout  the  Procedure  both  to  identify  what  is  being  evaluated  in 
each  item  and  for  the  system  Profiles  produced  by  the  Procedure. 

Classification  Scheme 

The  Benchmark  Evaluation  Procedure  is  being  developed  to  produce  descriptive  profiles  of 
NLP  system  capabilities,  displayable  at  several  levels  of  granularity  or  detail.  A  classifica¬ 
tion  scheme  is  being  developed  in  the  form  of  a  hieruchical  structure  to  provide  various 
levels  of  granularity.  Each  class  of  the  scheme  is  representative  of  a  subset  of  NLP  issues  or 
capabilities.  The  current  top  level  of  the  hierarchy  includes  the  following  classes  or  types  of 
linguistic  phenomena:  basic  sentence  types,  simple  verb  phrases,  noun  phrases,  quantifiers, 
simple  adverbials,  comparatives,  connectives,  and  embedded  sentences.  This  classification 
scheme  is  not  complete  nor  fixed,  of  course,  since  this  paper  reports  on  the  status  after 
only  the  first  six-month  phase.  Each  of  the  classes  mentioned  comprises  sub-classes  or  sub¬ 
types  of  linguistic  phenomena.  The  bottom  level  of  the  classification  scheme  consists  of 
the  individual  capabilities.  As  the  project  continues,  issues  that  are  being  added  to  the 
classification  scheme  and  evaluation  procedure  include:  different  verb  types,  tense  and  as¬ 
pect,  verb  phrases,  reference  (including  anaphoric  and  cataphoric),  ellipsis,  and  semantics  of 
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events.  The  classification  scheme  is  not  being  limited  by  the  current  state  of  the  art  in  NLP 
capabilities,  but  is  being  developed  to  include  generic  capabilities  of  human-to-human  com- 
munication  in  natural  language  that  could  be  applicable  to  human-machine  communication 
in  the  future. 

5,  The  Evaluation  Procedure 

The  Procedure  is  being  designed  to  be  domain  independent.  Therefore,  the  Procedure  does 
not  include  nor  rely  on  a  particular  corpus  of  natural  language  text  or  sentences.  Instead, 
the  test  sentences  or  paragraphs  to  be  processed  by  the  NLP  system  are  composed  by  the 
evaluator  either  during,  or  prior  to,  the  administration  of  the  Evaluation  Procedure.  The 
Procedure  is  designed  to  assist  the  evaluator  with  the  creation,  modification,  or  tailoring  of 
test  sentences. 

Since  the  Procedure  is  being  designed  for  use  by  people  who  are  not  well  versed  in  lin¬ 
guistics,  each  Procedure  test  item  includes  explanatory  material  that  is  intended  to  provide 
sufficient  instruction  to  enable  the  evaluator  to  compose  legal  test  sentences,  and  to  score 
the  performance  of  NLP  systems  on  these  test  items. 

As  stated  previously,  the  Procedure  is  being  designed  to  progress  from  very  elementary  sen¬ 
tence  types  containing  simple  constituents  to  more  complex  sentence  (group)  types.  The 
Procedure  is  being  developed  so  that,  for  untested  individual  NLP  capabilities,  each  Proce¬ 
dure  item  tests  just  one  NLP  capability  at  a  time,  to  the  extent  possible,  and  combinations 
are  tested  after  the  individual  capabilities  are  tested.  Each  Procedure  item  consists  of  the 
following  components: 

•  A  brief  explanation  and  definition  of  the  linguistic  capability  being  tested,  along  with 
any  special  instructions  for  testing.  This  is  particularly  important  for  evaluators  with 
no  linguistic  background. 

•  Patterns  that  define  the  structure  and  features  of  the  test  sentences  to  be  composed 
and  input  to  the  NLP  system  under  evaluation.  The  patterns  may  include  non-terminal 
words  from  closed  classes  (e.g.,  prepositions,  connectives).  The  domain-specific  words 
of  the  test  sentences  are  supplied  by  the  evaluator,  appropriate  to  the  particular  ap¬ 
plication  for  which  the  NLP  system  has  been  installed  smd  with  which  it  executes. 

•  Example  sentences  to  aid  the  evaluator  in  composing  test  senteL-^ces. 

•  A  box  for  the  evaluator’s  test  sentences, 
e  A  box  for  the  evaluator’s  score. 

Appendix  A  includes  an  abbreviated  excerpt  from  the  Procedure  section  on  relative  clauses. 
Rows  of  asterisks  mark  the  places  where  material  has  been  omitted.  However  abbreviated. 
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the  example  does  provide  a  sample  of  the  type  of  explanatory  material,  instructions,  exam¬ 
ples,  and  recording  provisions  that  arc  included  in  the  Procedure. 

For  each  test  item  in  the  Procedure,  the  e^'aluator  submits  a  NL  input  to  the  NLP  sys¬ 
tem  being  eN'aluated  and  determines  whether  or  not  the  response  indicates  that  the  system 
processed  the  input  correctly  and  understood  the  input.  The  evaluator  has  four  choices  of 
scores  to  award  to  the  system  for  each  test  item,  as  listed  below.  The  Procedure  allows  for 
a  score  to  be  split  between  the  score  types  enabling  the  evaluator  to  indicate  confidence  in 
the  system’s  correct  processing  of  the  input.  Total  score  for  one  test  item  should  be  .'.0. 

•  Success;  The  system  successfully  processed  the  NL  input  and  indicated  by  its  response 
that  it  understood  the  input. 

»  Failure:  The  system  failed  to  understand  the  NL  input. 

•  Incleterirjnate;  The  evaluator  was  unable  to  determine  whether  the  system  understood 
the  NL  input,  even  after  trying  more  than  once  to  test  the  particular  NL  construct  or 
capability. 

«  Unable  to  compose  NL  input:  The  evaluator  was  unable  to  compose  a  NL  test  input 
for  the  Procedure  item.  This  problem  can  arise  if  the  language  handled  by  the  NLP 
system  being  evaluated  is  so  restricted  that  the  words  or  phrase  types  necessary  to 
compose  test  sentences  for  the  Procedure  item  are  lacking.  An  example  would  be  the 
attempt  to  evaluate  a  system’s  ability  to  handle  quantification  without  having  any  of 
the  quantifiers  in  the  lexicon  of  the  system  being  evaluated. 

6,  Evaluation  Profiles 

The  Evaluation  Procedure  is  designed  to  produce  descriptive  profiles  of  NLP  systems.  The 
profiles  are  hierarchically  organized  according  to  the  classification  scheme  discussed  in  Section 
4.  The  profiles  can  be  viewed  or  examined  at  any  level  of  granularity  (levels  of  granularity 
corresponding  to  the  hierarchy  levels).  At  the  bottom  level  of  the  hierarchy  are  the  individual 
NLP  capabilities.  At  any  level  other  than  the  bottom  level,  the  scores  of  the  lower  level  items 
or  classes  are  combined  in  a  weighted  average  to  produce  the  score  for  the  parent  class  or 
category.  The  weights  are  not  fixed,  but  may  be  specified  by  the  evaluator.  They  should 
remain  constant  when  using  the  Procedure  to  compare  different  systems.  Figure  2  shows  a 
sample  system  profile  consisting  of  only  the  top  level  of  the  hierarchy  Appendix  B  includes 
a  NLP  system  profile  that  shows  the  top  three  levels  of  results. 
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Evaluation  Profile  :  Top  Level 
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Figure  2:  A  S5'6tem  Profile:  Top  Level  Only 


7.  Assessment  of  the  Evaluation  Procedure 

As  stated  in  Section  1,  the  objectives  of  the  Benchmark  Project  specify  that  the  Evaluation 
Procedure  incorporate  the  following  important  features:  domain  independence,  consistency 
across  evaluators,  applicability  across  NLP  system  types,  usability  without  requiring  modi¬ 
fication  or  re-engineering  of  the  NLP  system  being  evaluated,  and  usability  by  non-linguists. 

The  assessment  task  is  designed  to  determine  whether  these  objectives  are  being  met  and 
to  provide  feedback  to  the  developers  to  improve  and  refine  the  Procedure  during  its  devel¬ 
opment.  As  part  of  the  assessment  of  the  Evaluation  Procedure,  the  Procedure  is  scheduled 
to  be  applied  to  three  different  NLP  systems  by  each  of  three  evaluators,  called  Interface 
Technologists,  at  the  end  of  each  of  the  three  six-month  phases  of  the  project.  The  following 
are  some  of  the  important  features  of  the  design  of  the  assessment  task: 

•  To  achieve  an  impartial  assessment  of  the  Procedure,  the  Interface  Technologists  have 
had  no  involvement  in  the  development  of  the  Procedure. 

•  To  ensure  that  the  Procedure  is  not  biased  with  regard  to  a  particular  type  of  NLP 
system,  a  variety  of  NLP  system  types  are  scheduled  to  be  used  in  the  assessment  task. 
As  a  minimum,  database  front-end  NL  query  systems  and  text  understanding  systems 
ere  scheduled  for  use.  The  NLP  systems  selected  are  a  mix  of  commercially  available 
systems  and  advanced  research  products.  Additional  types  of  systems  will  be  used 
during  assessment  of  the  Procedure,  depending  on  the  availability  and  cost  of  using 
other  systems. 

•  To  minimize  biais  with  regard  to  particular  NLP  systems,  at  least  one  of  the  NLP 
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systems  used  at  each  assessment  milestone  is  scheduled  to  be  a  system  that  has  not 
pre%'iously  been  used  in  the  assessment  activities. 

•  To  ensure  that  the  Procedure  can  be  used  with  different  application  domains,  NLP 
systems  with  at  least  two  different  domains  are  scheduled  for  use  in  the  assessment 
activities. 

•  To  minimize  bias  caused  by  order  of  Procedure  application  by  the  Interface  Technolo¬ 
gists,  a  Latin  square  design  is  being  used  for  the  Procedure  applications.  This  Latin 
square  design  is  illustrated  in  Table  2. 


Table  2:  Latin  Square  Design  for  the  Assessment  Task 
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As  part  of  the  assessment  task  at  each  of  the  three  six-month  milestones,  the  Evaluation 
Procedure  is  being  evaluated  using  several  techniques: 

•»  Statistically:  Data  generated  by  the  nine  applications  of  the  Procedure  arc  analyzed 
statistically,  even  though  the  number  of  subjects  is  small. 

•  Critique  during  use:  The  Interface  Technologists  record  problems,  criticisms,  and  sug¬ 
gestions  regarding  individual  Procedure  items  during  the  use  of  the  Procedure. 

•  Error  Analysis:  The  developer*  of  the  Evaluation  Procedure  examine  the  Procedure 
books  completed  by  each  Interface  Technologist  for  each  NLP  system  and  examine  any 
error*  made  by  the  Technologists,  particularly  in  composing  test  sentence#  during  their 
use  of  the  Procedure.  An  item  analysis  is  performed  to  identify  -irijlch  items  caused 
problems  across  Interface  Technologists.  The  errors  are  investigated  to  determine  the 
nature  and  cause  of  the  errors. 

•  Questionnaire;  The  Interface  Technologist*  complete  an  Assessment  Questionnaire  de¬ 
veloped  to  evaluate  the  Procedure. 
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Because  of  limited  space  in  this  paper,  we  have  not  included  detailed  results  of  all  the 
assessment  techniques.  They  are,  however,  available  from  the  authors.  In  assessing  the 
consistency  of  the  Ev-aluation  Procedure  across  Interface  Technologists,  the  scores  for  each 
NLP  system  were  compared  across  Interface  Technologists.  The  graph  of  Figure  3  shows 
the  score  for  each  major  category  for  each  Interface  Technologist,  As  the  graph  shows,  the 


Figure  3:  Major  Category  Scores  Across  Interface  Technologists  for  System  #3 

scores  were  very  consistent  across  Interface  Technologists  except  for  the  section  on  Adverbs. 
This  aberration  was  due  to  the  small  section  on  adverbs,  just  three  items.  Furthermore, 
adverbs  were  not  common  in  the  vocabulary  of  the  NLP  systems  used.  One  of  the  Interface 
Technologists  was  able  to  use  some  advcrb(s)  that  the  system  understood,  while  the  other 
Technologists  were  not  able  to  do  so. 


8.  Current  Status  and  Future  Directions 

This  paper  reports  on  the  status  of  the  Benchmark  project  as  of  the  end  of  its  first  six- 
month  phase.  Development  of  the  Procedure  has  continued  so  that  other  major  categories 
of  lingxiistic  phenomena  have  been  included  in  the  Procedure  and  Classification  Scheme.  We 
vill  soon  be  at  another  assessment  milestone,  assessing  the  extended  and  revised  Procedure 
with  two  out  of  three  systems  being  new  to  the  project,  and  a  new  additional  application 
domain. 

Possible  future  directions  for  this  effort  include  continuing  to  extend  the  coverage  of  the 
Evaluation  Procedure,  since  it  will  be  impossible  to  cover  all  relevant  linguistic  phenomena 
in  this  Project.  Other  areas  to  which  the  Procedure  could  be  extended  include  knowledge 
acquisition  and  the  handling  of  Lll-formed  input.  The  Procedure  could  also  be  improved  by 
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providing  the  c\’aluators  with  automated  support  for  some  of  the  activities  performed  during 
the  application  of  the  Procedure  and  the  generation  of  systems’  Profiles. 


9.  Summary 

This  paper  has  discussed  the  Benchmark  Investigation/Identification  Project,  the  Evaluation 
Procedure  being  developed  as  part  of  the  project,  and  the  results  of  the  first  six-month  phase 
of  the  project.  Important  features  of  the  Benchmark  Procedure  include  the  fact  that  it  is 
being  developed  to  produce  comprehensive  profiles  of  NLP  systems  that: 

•  provide  descriptive  information  regarding  the  linguistic  phenomena  being  tested, 

•  are  hierarchically  organized  according  to  a  classification  scheme  that  provides  various 
levels  of  granularity  for  profile  display, 

•  provide  quantitative  information  based  on  the  scores  that  are  assigned  by  the  evaluators 
to  individual  test  items;  the  scores  are  aggregated  by  class  and  weighted  averages 
calculated  for  each  class  in  the  hierarchy,  and 

•  are  objective  in  that  test  items  are  defined  in  a  detailed  manner  so  as  to  remove  the 
subjectivity  of  the  evaluator. 

In  response  to  difficulties  that  have  been  recognized  in  other  evaluation  methods,  the  Bench¬ 
mark  Procedure  is  being  designed  with  certain  important  attributes: 

•  The  Procedure  is  being  developed  to  be  usable  with  different  application  domains. 
The  great  value  in  this  feature  is  that  the  system  developers  are  saved  the  (possibly 
considerable)  expense  of  re-engineering  or  porting  the  system  to  a  new  domain  in  order 
to  be  tested.  This  feature  is  unique  to  the  Benchmark  Evaluation  approach. 

9  The  Procedure  is  being  developed  to  be  usable  with  different  types  of  NLP  systems  such 
as  database  query  NL  front-end  systems,  text/message  processing  systems,  interactive 
NL  dialogue  interfaces,  etc.  Since  these  classes  of  NLP  systems  are  not  mutually 
exclusive  and  some  of  the  different  classes  have  linguistic  capabilities  in  common,  the 
idea  is  to  have  a  test  that  can  provide  comprehensive  profiles  of  these  different  types 
of  systems.  This  is  another  unique  feature  of  the  Benchmark  Evaluation  approach. 

•  The  Procedure  is  repeatable  and  produces  consistent  results,  independent  of  evaluator. 

«  The  Procedure  doM  not  require  that  the  evaluator  be  a  trained  linguist.  The  Proce¬ 
dure  items  include  instructions,  explanatory  material,  and  exsonples  that  enable  the 
evaluator  to  create,  modify,  or  tailor  test  sentences  for  a  particular  NLP  system. 


The  scoring  method  provides  the  evaluator  with  four  choices  when  scoring  a  system’s  perfor¬ 
mance:  Success,  Failure,  Indeterminate,  and  Unable  to  compose  input.  Indeterminate  means 
that  the  evaluator  was  unable  to  determine  whether  the  system  understood  the  NL  input, 
even  after  several  different  attempts  to  test  for  the  particular  capability.  An  evaluator  may 
be  unable  to  compose  input  'll  the  language  of  the  NLP  system  is  so  restricted  that  the  needed 
words  or  phrase  types  are  not  available. 

As  part  of  the  Benchmark  Project,  the  Evaluation  Procedure  is  being  assessed  to  determine 
whether  the  Procedure  meets  the  objectives  of  the  project.  The  assessment  task  provides  for 
the  Procedure  to  be  applied  to  three  different  NLP  systems  by  each  of  three  evaluators  at 
the  end  of  each  of  the  three  six-mont''  phases  of  the  project.  The  results  of  the  first  phase 
assessment  seemed  to  indicate  that  the  Procedure,  as  developed  so  far,  does  indeed  meet  the 
objectives  listed  above. 

The  Benchmark  Evaluation  Procedure  should  prove  to  be  a  comprehensive  evaluation  method 
and  tool  that  is  useful  for  assessing  the  development  of  individual  systems,  for  comparing 
different  systems,  and  for  matching  NLP  systems  with  the  requirements  of  application  tasks 
and  systems.  We  welcome  feedback  from  members  of  the  computational  linguistics  com¬ 
munity.  As  the  project  moves  from  Phase  I  into  Phase  II,  the  Procedure  continues  to  be 
extended  and  improved. 

10.  References 

[BatesOO]  Bates,  M.,  Boisen,  S.,  and  Makhoul,  J.  1990.  Developing  an  Evaluation  Method¬ 
ology  for  Spoken  Language  Systems,  In  Proceedings  of  the  Speech  and  Natural  Language 
Workshop,  June,  1990,  Morgan  Kaufmann  Publishers,  Inc. 

[BBN88]  BBN  Systems  and  Technologies  Corp.  Dec.,  1988.  Draft  Corpus  for  Testing  NL 
Data  Base  Query  Interfaces,  NL  Evaluation  Workshop,  Wayne,  PA. 

[Biermann83]  Biermann,  A.W.,  B.^ilard,  B.W.,  and  Sigmon,  A.H.  1983.  An  Experimental 
Study  of  Natural  Language  Programming.  International  Journal  of  Man^Machine 
Studies,  18,  pp.  71-87. 

[Cohen88]  Cohen,  P.R.  and  How.;,  A.E.  1988.  Towards  AI  Research  and  Methodology: 
Three  Case  Studies  in  Evaluation.  COINS  Technical  Report  88-31,  University  of  Mas¬ 
sachusetts,  Amherst,  MA. 

[FlickingerST]  Fli danger,  D.,  Nerbonne,  J.,  Sag,  I.,  and  Wasow,  T.  1987.  Toward  Eval¬ 
uation  of  Natural  Language  Processing  Systems.  Technical  Report,  Hewlett-Packard 
Laboratories. 

[Guida86]  Guida,  G.  and  Mauri,  G.  1986.  Evaluation  of  Natural  Language  Processing 
Systems:  Issues  and  Approaches.  Proceedings  of  the  IEEE,  74(7),  pp.  1026-1035. 


54 


[Hayes-Roth89]  Hayes-Roth,  F.  1989.  Towards  Benchmarks  for  Knowledge  Systems  and 
Their  Implementations  for  Data  Engineering.  Traruaettons  on  Knowledge  and  Data 
Engineering,  Vol.  1,  No.  1,  March,  1989,  pp.  101-109. 

[Hendrix78]  Hendrix,  G.G.,  Sacerdoti,  E.D.  and  Slocum,  J.  1976.  Developing  a  Natural 
Language  Interface  to  Complex  Data.  Technical  Report.  Artificial  Intelligence  Center, 
SRI  International. 

[HermansenST]  Hermansen,  J.  1987.  Message  Processing  Systems;  Evaluation  Factors. 
Technical  TM-RD-87-1,  Planning  Research  Corp. 

[Hershman79]  Hershman,  R.L.,  Kelly,  R.T.,  and  Miller,  H.G.  1979.  User  Performance  with 
a  Natural  Language  Query  System  for  Command  Control.  Technical  Report  NPRDC- 
TR-79-7.  San  Diego,  CA:  Navy  Personnel  Research  and  Development  Center. 

(Hix9l]  Hix,  D.  and  Schulman,  R.S.  1991.  Human-Computer  Interface  Development  Tools: 
A  Methodology  for  their  Evaluation.  Communications  of  the  ACM,  Vol.  34.  No.  3, 
pp.  75-87. 

[Kohoutek84]  Kohoutek,  H.  J.  1984.  Quality  Issues  in  New  Generation  Computing.  Pro¬ 
ceedings  of  the  International  Conference  on  Fifth  Generation  Computer  Systems,  ICOT, 
pp.  695-700. 

(Laz2ara90]  Lazzara,  A.V.,  Tepfenhart,  W.T.,  and  Thomas,  M.  1990.  Language  Analy¬ 
sis  Domain  Independent  Exploitation  Shell  (LADIES);  Performance  Evaluation  Plan 
(PEP).  Interim  Technical  Report.  Knowledge  Systems  Concepts,  Inc. 

[Malhotra75]  Maihotra,  A.  1975.  Design  Criteria  for  a  Knowledge- Based  Language  System 
for  Management:  An  Experimental  Analysis.  MIT/LCS/TR-146. 

[Ogden88]  Ogden,  W.  C.  1988.  Using  Natural  Language  Interfaces.  Handbook  of  Human- 
Computer  Interaction,  pp.  281-299. 

[Palraer89]  Palmer,  M.,  Finin,  T.,  and  Walter,  S.M.  1989.  W^orkshop  on  the  Evaluation  of 
Natural  Language  Processing  Systems.  RADC-TR-89-302.  RADC  Technical  Report 
on  the  Workshop  held  in  Wayne,  PA,  in  December  1988. 

[Read88]  Read,  W.,  Quilid,  A.,  Reeves,  J.,  Dyer,  J.,  and  Baker,  E.  1988.  Evaluating 
Natural  Language  Systems:  A  Sourcebook  Approach.  COLING-88,  pp.  530-534. 

[Sundheim9l]  Sundheim,  B.  (ed.)  1991.  Proceedings  of  the  Third  Message  Understanding 
Conference,  Morgan  Kauimann  Publishers  (forthcoming). 

[Tennant79]  Tennant,  H.  1979.  Fjcpcrience  with  the  Evaluation  of  Natural  Language  Ques¬ 
tion  Answerers.  Proceedings  of  the  Sixth  IJCAI,  pp.  874-876. 


[Weischedel86]  Wcischedcl,  R.M.  1986.  E%'aluating  Natural  Language  Interfaces  to  Expert 
Systems,  Proceedings  of  1986  IEEE  International  Conference  on  Systems,  Mam,  and 
Cybernetics. 


i<v.  vSX<y34;va*i.'->5-r»vf'.'.~  .  7,:.: 


APPENDIX  A.  Excerpts  from  the  Benchmark  I/I  Procedure 

NOTE;  Rows  of  asterisks  indicate  where  Procedure  materia!  has  been  omitted. 

5.  Noun  Phrase  Postmodification 

5.1  Relative  Clauses 

A  rela!i\e  clause  is  a  sentence  that  is  embedded  in  the  postmodification  position  of  a  noun  phrase. 
\  full  relative  clause  consists  of  a  relative  pronoun  followed  bj  a  sentence  or  verb  phrase  with  some 
omitted  constituent(s).  For  example,  the  sentence 

“The  plane  [that  we  saw]  was  a  DC- 10” 

includes  the  relative  clause  “that  we  saw”  in  brackets.  This  rel..tive  clause  consists  of  the  relative 
pronoun  “that”  followed  by  the  sentence  “we  saw  (the  plane)”  where  “the  plane”  has  been  omitted. 

Relative  pronouns  have  the  double  role  of  referring  to  the  antecedent  (the  head  of  the  noun  phrase 
being  modified)  and  of  functioning  as  a  constituent  in  the  relative  clause  (e.g.,  the  omitted  object 
the  plane  in  the  above  relative  clause). 


5.1.1  Relative  Pronoun  as  Subject 

The  structure  of  this  type  of  relative  clause  is: 

[Rel-Pronoun]  [VP] 

Eg,  [that]  [hired  Hary  Smith] 

Eg,  [who]  [joined  the  C.S.  Department] 

In  the  next  three  test  items,  you  will  use  this  type  of  relative  clause  in  the  postmodification  position 
of  a  noun  phrase,  first  with  the  pronoun  THAT,  then  with  a  personal  pronoun,  then  with  a  non¬ 
personal  pronoun. 

5. 1.1.1  The  Relative  Pronoun  THAT  -  Restrictive  Only 

Eg,  Is  John  Smith  the  person  [that  is  V.P.  of  Finance]  ? 

Eg,  Who  is  the  person  [that  is  V.P.  of  Finance]  ? 

Eg,  List  the  person  [that  is  V.P.  of  Finance], 


Score: 


57 


5.1.2  Relative  Pronoun  as  Object 

The  structure  of  this  tyy'e  of  relative  clause  is: 


[R«l-Pronoun]  [NP]  [Verb] 

Eg,  [that]  [James  Harris]  [hired] 

Eg,  [ii'.o]  [the  Chairman]  [promoted] 

In  the  next  two  test  items,  you  will  use  this  type  of  relative  clause  in  the  postmodification  position 
of  a  noun  phrase. 

5. 1.2.1  The  Relative  Pronoun  THAT  as  Object  -  Restrictive  Only 

Eg.  Is  John  Smith  the  person  [that  the  sales  department  promoted]  ? 

Eg,  Who  is  the  person  [that  Mark  Watson  hired]  ? 


Score: 


5.1.4  Relative  Pronoun  Prepositional  Object 

5.1,4.!  Preposition  First  in  the  Relative  Clause 
The  structure  of  this  type  of  relative  clause  is: 

[Preposition]  [Rel-Pronoun]  [HP]  [VP] 

Eg,  [for]  [whom]  [Jim  Davis]  [works] 


5. 1.4. 1.1  Preposition  First  with  Personal  Pronoun 

Eg,  Is  John  Smith  a  parson  [for  whom  you  have  an  address]  ? 
Eg,  List  the  employees  [for  whom  John  Smith  is  supervisor]. 


Score: 
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Evaluation  Profile 
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4.2  Central 


Evaluation  Profile 
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Evaluating  Syntax  Performance  of  Parser/Grammars 

of  English 


Philip  Harrison,  Boeing  Computer  Services 

Steven  Abney,  Bellcore 

Ezra  Black,  IBM 

Dan  Flickinger,  Hewlett  Packard 

Claudia  Gdaniec,  Logos,  Inc. 

Ralph  Grishman,  NYU 

Donald  Hindle,  AT&T 

Roben  Ingria,  BBN 

Mitch  Marcus,  U.  of  Pennsylvania 

Beatrice  Santorini,  U.  of  Pennsylvania 

Tomek  Strzalkowski,  NYU 

We  repon  on  an  ongoing  collaborative  effort  to  develop  criteria,  methods,  measures 
and  procedures  for  evaluating  the  syntax  performance  of  different  broad-coverage 
parser/grammars  of  English.  The  project  was  motivated  by  the  apparent  difficulty  of 
comparing  different  grammars  because  of  divergences  in  the  way  they  handle  various 
syntactic  phenomena.  The  availability  of  a  means  for  useful  comparison  would  allow 
hand-bracketed  corpora,  such  as  the  University  of  Pennsylvania  Treebank,  to  serve  as  a 
source  of  data  for  evaluation  of  many  grammars.  The  project  has  progressed  to  the 
point  where  the  first  version  of  an  automated  syntax  evaluation  program  has  been  com¬ 
pleted  and  is  available  for  testing.  The  methodology  continues  to  undergo  refinement 
as  more  data  is  examined. 

The  project  began  with  a  comparison  of  hand  syntactic  analyses  of  50  Brown  Corpus 
sentences  by  grammarians  from  nine  organizations:  Steve  Abney  (Bellcore),  Ezra 
Black  (IBM),  Dan  Flickinger  (Hewlett  Packard),  Claudia  Gdaniec  (Logos),  Ralph 
Grishman  and  Tomek  Strzalkowski  (NYU),  Philip  Harrison  (Boeing),  Donald  Hin^e 
(AT&T),  Robert  Ingria  (BBN),  and  Mitch  Marcus  and  Beatrice  Santorini  (U.  of 
Pennsylvania).  The  purpose  of  the  bracketing  exercise  was  to  provide  a  focus  for  the 
discussion  of  syntactic  differences  and  a  source  of  data  to  test  proposals  for  evaluation 
techniques.  The  participating  grammarians  produced  labelled  bracketings  representing 
what  they  ideally  want  their  gramman  to  specify.  After  the  exercise  was  completed,  a 
small  workshop  was  held  at  the  University  of  Pennsylvania  to  discuss  the  results  and 
examine  proposals  for  evaluation  methodologies. 

The  results  of  the  hand-bracketing  exercise  revealed  that  very  little  structure  was  com¬ 
mon  to  all  the  parses.  For  example,  an  analysis  revealed  that  the  following  three 
Brown  Corpus  sentences  (taken  from  what  we  call  the  "consensus"  parses)  display 
only  the  indicated  phrases  in  common  to  all  of  the  bracketings: 

The  famed  Yankee  Clipper,  now  retired,  has  been  assisting  (as  (a  batting  coach)). 

One  of  those  capital-gains  ventures,  in  fact,  has  saddled  him  (with  (Gore  Coun)). 
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He  said  this  constiiuied  a  (very  serious)  misuse  (oi  the  (Criminal  court)  processes). 

A  rather  more  encouraging  result  was  obtained  when  phrases  were  selected  which  ap¬ 
peared  bracketed  in  a  majority  of  the  analyses  (the  "majority"  parses): 

((((The  famed  (Yankee  Clipper)) ,  (now  retired) ,) 

(has  been  (assisting  (as  (a  batting  coach))))) .) 

(((One  (of  (those  capital-gains  ventures))) ,  (in  fact)  , 

(has  (saddled  him  (with  (Gore  Coun))))) .) 

((He  (said  (this  (constituted  (a  (very  serious)  misuse 

(of  (the  (Criminal  coun)  processes))))))) .) 

The  lack  of  structure  for  the  consensus  parses  is  a  reflection  of  the  diversity  of  ap¬ 
proaches  to  such  phenomena  as  puncmation,  the  employment  of  null  nodes  by  the 
grammar,  and  the  attachment  of  auxiliaries,  negation,  pre-infinitival  ‘to’,  adverbs,  and 
other  types  of  constituents.  But  the  results  for  the  majority  parses  indicated  that  a 
good  foundation  of  agreement  exists  among  the  several  grammars. 

The  challenge  was  to  find  an  evaluation  method  that  would  not  penalize  even  those 
analyses  that  diverged  from  the  majority  in  ways  that  would  be  considered  generally 
acceptable.  The  proposed  solution,  explored  in  depth  by  hand  analysis  at  the 
workshop,  involves  1)  the  systematic  elimination  of  certain  problematical  constructions 
from  the  parse  tree  (resulting  in  trees  that  show  a  much  higher  degree  of  strucmral 
agreement)  and  2)  systematic  restructuring  of  constituents  to  a  minor  degree  for  partic¬ 
ular  constructions  if  the  grammar  being  evaluated  differs  from  the  evaluation  standard 
for  these  constructions.  The  evaluation  program  itself  carries  out  the  elimination  of 
constituents  for  both  the  standard  parse  and  the  parse  being  tested  (hereafter  the  test  or 
candidate  parse,  provided  by  the  client  grammarian  for  evaluation).  The  client  is 
responsible  for  restructuring  the  special  constructions  in  the  test  parse.  These  restruc¬ 
turings  will  be  discussed  after  the  evaluation  procedure  itself. 

The  proposed  evaluation  procedure  has  been  implemented  and  is  sriJl  undei^oing 
analysis  and  modification,  but  generally,  it  has  these  characteristics:  it  judges  a  parse 
based  only  on  the  constituent  boundaries  it  stipulates  (and  not  the  categories  or 
features  that  may  be  assigned  to  these  constituents);  it  compares  the  parse  to  a  hand- 
parse  of  the  same  sentence  from  the  Univers  :y  of  Pensylvania  Trccbanl:,  (the  standard 
parse);  and  it  yields  two  principal  measures  for  each  pane  submitted:  Crossing 
Parentheses  and  Recall. 

The  procedure  has  three  steps.  For  each  pane  to  be  evaluated: 

(1)  erase  all  word-external  punctuation  and  null  categories  from 
both  the  standard  tree  and  the  test  tree;  use  the  standard  tree 

to  identify  and  erase  from  both  trees  all  instances  of:  auxiliaries, 

"not",  pre-infinitival  "to",  and  possessive  endings  (’s  and  ’). 

(2)  recursively  eliminate  from  both  trees  all  parenthesis  pairs 
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enclosing  cither  a  single  constituent  or  word,  or  nothing  at  all; 

(?)  using  the  nodes  that  remain,  compute  goodness  scores 
(Crossing  Parentheses,  and  Recall)  for  the  input  parse, 
by  comparing  its  nodes  to  a  similarly-reduced  node  set 
for  tnt  ttandard  parse. 

For  example,  for  the  Brown. Corpus  sentence: 

Miss  Xydis  v  m  best  when  she  did  not  need  to  be  too  probing. 

consider  the  candidate  parse: 

fS  (NP-s  (PNP  (PNP  Miss)  (PNP  Xydis))) 

(VP  (VPAST  was)  (ADJP  (ADJ  best))) 

(3  (COMP  (WIIADVP  (WHADV  when))) 

(NP-s  (PRO  she)) 

(VP  (X  (VPAST  did)  (NEC  not)  (V  need)) 

(VP  (X  (X  to)  (V  be))  (ADP  (ADV  too)  (ADJ  probing))))) 

(?  (FLN  .))) 

After  step-one  erasures,  this  becomes: 

(S  (NP-s  (PNP  (P^^P  Miss)  (PNP  Xydis))) 

(VP  (VPaST  was)  (ADP  (ADJ  best))) 

(S  (COMP  (WHADVP  (WHADV  when))) 

(NP-s  (PRO  she)) 

(VP  (X  (VPAST  )  (NEG  )  (V  need)) 

(VP.X  (X  )  (V  be))  (ADJP  (ADV  too)  (ADJ  probing))))) 

(?(FIN  ))) 

And  iuXCT  step-two  erasures; 

(S  (NP-s  Miss  Xydis)  (VP  was  best) 

(S  when  she  (VP  need  (V  be  (A  DP  too  probing))))) 

The  Uruversity  of  Pennsylvania  Treebank  output  for  this  sentence,  after  steps  one  and 
two  have  been  applied  to  ir,  is; 

(S  (S  (NP  Miss  Xydis)  (VP  was  best))  ’ 

(SBAR  when  (S  she  (VP  need  (VP  be  (ADJP  too  probing)))))) 

Step  three  consists  of  comparing  the  candidate  parse  to  the  Treebank  parse  and  deriv¬ 
ing  two  scores:  (1)  The  Crossing  Parentheses  score  is  the  number  of  times  the  candi¬ 
date  pa:rbt;  has  a  structure  such  as  ((A  B)  C)  and  the  standard  parse  has  one  or  more 
structures  such  as  (A  (B  C))  wliich  "cross"  with  the  test  parse  structure.  (2)  The  Re¬ 
call  scon;  is  the  number  of  parenthesis  pain  in  the  intcrsectio  of  the  candidate  and 
ucebank  parses  (T  intersection  C)  divided  by  the  number  of  parenthesis  pain  in  the 
treebank  par  c  T,  viz.  (T  intersection  Q  /  T.  This  score  provides  an  additional  mcas- 
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urc  of  ihe  degree  of  fit  between  the  standard  and  die  candidate  parses;  in  theory  a  Re¬ 
call  of  I  certifies  a  candidate  parse  as  including  all  constituent  boundaries  that  are  con¬ 
sidered  essential  to  the  analysis  of  the  input  sentence  by  the  Treebank.  (Trccbank 
parses  are  in  general  underspecified  because  certain  structures,  such  as  compound 
nouns,  arc  not  bracketed.)  For  the  above  example  sentence,  there  are  no  crossings  and 
the  recall  is  7/9. 

The  last  element  of  the  proposed  evaluation  method  involves  the  restructuring  of  trees 
by  the  client,  which  is  necessary  only  if  the  parse  submitted  treats  any  of  certain  con¬ 
structions  in  a  manner  different  from  the  standard.  At  the  workshop,  three  construc¬ 
tions  were  identified:  extraposition,  modification  of  noun  phrases  by  post-head  phrases 
such  as  PP,  and  sequences  of  prepositions  which  occur  constituent-initially  and/or  par¬ 
ticles  which  occur  constituent-finally.  Briefly,  for  extraposition  sentences  like  It  is 
necessary  for  us  to  leave  the  extaposed  phrase  for  us  to  leave  should  be  attached  at  the 
S  level  and  not,  for  example  as  a  sister  of  necessary.  For  NP  modification,  post-head 
modifiers  should  be  attached  to  the  NP  and  not  at  the  N-BAR  level.  Finally,  for  se¬ 
quences  of  prepositions/particlcs  we  attach  to  the  top  node  of  the  constituent.  Thus  if 
the  initial  client  analysis  is 

(We  (were  ((out  oO  (oatmeal  cookies)))) 
then  the  restructured  analysis  should  be 

(We  (were  (out  of  (outmeal  cookies)))). 

These  three  constructions  were  identified  from  a  hand  analysis  of  a  limited  amount  of 
data  and  we  are  currently  examining  more  data  to  see  whether  the  list  should  be  ex¬ 
tended. 

Generally,  there  are  two  strategics  that  can  be  followed  in  cases  where  a  client’s 
analysis  differs  systematically  from  the  standard:  modify  the  evaluation  program  so 
that  it  deletes  certain  nodes,  or  specify  a  procedure  that  can  be  adopted  by  clients  to 
bring  their  trees  into  conformity  with  the  standard.  However,  we  have  seen  that  there 
arc  instances  where  reconciliation  is  very  difficult  or  impossible  and  are  working  to  as¬ 
sess  the  expected  frequency  of  such  cases. 

Before  the  evaluation  software  was  available,  we  applied  the  method  by  hand,  using 
the  UPenn  Treebank  as  a  standard,  to  14  of  the  a^ve-mentioned  50  Brown  Corpus 
sentences  which  were  given  their  "ideal"  analyses  by  the  grammarians.  (Canonical 
modifications  as  specified  above  were  required.)  The  sentences  were  selected  because 
they  had  been  successfully  run  by  one  of  our  automated  systems  (NYU’s)  and  were 
expected  to  give  some  hint  of  the  method’s  reliability  for  sentences  that  are  easy  for 
automated  systems.  The  Crossing  score  was  zero  in  every  case  and  the  corresponding 
Recall  average  score  was  94%.  We  were  encouraged  by  this  initial  result  to  punue 
the  development  of  software  to  carry  out  uhe  scoring. 
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After  the  evaluation  program  became  available,  we  ran  it  on  the  entire  50  sentence 
corpus  and  obtained  the  following  results: 


crossings 

recall 

AT&T 

3 

(1) 

.88 

BBN 

4 

(1) 

.86 

Bellcore 

10 

(5) 

.87 

Boeing 

4 

(1) 

.97 

HP 

4 

(0) 

.97 

IBM 

4 

(2) 

.96 

Logos 

3 

(0) 

.86 

NYU 

10 

(10) 

.79 

The  first  number  in  the  crossings  column  is  the  total  number  of  sentences  that  con¬ 
tained  a  crossing  while  the  second  number  in  parentheses  is  the  number  of  sentences 
with  crossings  that  remain  after  certain  policy  changes  are  implemented  in  the  standard 
parse  and  the  node  deletion  protocol  of  the  evaluation  procedure. 

There  are  several  points  to  made  about  the  above  data:  We  feel  that  the  number  of 
crossings  initially  obtained  is  unacceptably  high  and  that  changes  in  the  standard 
bracketing  procedures  or  changes  in  the  deletion  protocols  need  to  be  adopted. 
Second,  the  number  of  crossings  obtained  after  a  few  suggested  changes  are  imple¬ 
mented  (the  number  in  parentheses)  is  an  acceptable  level  of  crossings  for  a  50  sen¬ 
tence  corpus  for  all  but  two  of  the  gramman.  However,  until  more  data  are  examined, 
we  will  not  know  whether  this  level  of  crossings  can  be  maintained  with  a  fixed 
evaluation  method.  We  are  still  in  a  "training  phase"  as  far  as  the  bracketing  and  dele¬ 
tion  policies  go  and  the  actual  level  that  will  be  attained  may  turn  out  to  be  less  than 
is  acceptable.  The  policy  changes  themselves  are  still  being  debated  by  the  group. 
Finally,  we  note  that  two  of  the  grammars  (Bellcore’s  and  NYU’s)  differ  significantly 
from  the  others  with  respect  to  crossings.  The  Bellcore  grammar  is  based  on  a  new 
grammar  methodology  called  "chunking’’  which  results  in  non-standard  phrasal  group¬ 
ings  in  some  instances  while  the  NYU  grammar  has  significantly  different  in  that  it 
does  not  use  any  category  corresponding  to  verb  phrase,  which  results  in  non-standard 
attachments.  It  is  unclear  at  this  time  whether  convenient  transformations  can  be 
found  to  allow  these  grammars  to  be  compared  to  the  standard  so  as  to  reduce  their 
crossings  scores. 

There  are  four  proposed  changes  to  the  evaluation  method  and  the  standard  that  are 
being  debated  at  this  time  by  our  group.  If  the  four  policies  below  are  adopted,  then 
the  crossing  scores  obtained  are  the  ones  in  parentheses  in  the  above  table.  The  four 
policies  are: 

1)  Delete  left-recursive  subnodes  of  type  S  from  the  standard.  The  Treebank  us^s 
recursive  attachment  at  the  S  level  for  adverbial  attachment  in  sentences  like 
Miss  Xydis  was  best  when  she  did  not  nead  to  be  too  probing 
which  results  in  a  strucnirc  of  the  form  (S  (S  (A  ..)(B...))(C  ...)).  Several  of  us  pre¬ 
ferred  to  attach  the  rightmost  constituent  (the  ‘when’  phrase)  at  a  lower  level.  With  a 
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stTJCtJire  of  the  form  (S  (A  ...)(C  ...))  of  the  crossings  arc  eliminated  from  our 
data.  This  policy  can  be  implemented  by  the  evaluation  program. 

2)  Flatten  structures  in  the  standard  containing  the  collocations  less  than,  more  than, 
greater  than,  etc.  when  they  precede  a  number  or  adjective.  Some  of  us  take  these  col¬ 
locations  as  constituents  (under  a  certain  reading  of  the  sentence)  while  others  always 
build  phrases  with  than  and  phrases  to  its  right  before  combining  with  less,  more,  etc. 
The  lack  of  agreement  among  practitioners  can  be  accommodated  only  if  the  standard 
is  neutral.  So  the  phrase  more  than  4000,000,000  inhabitants  would  need  to  be  brack¬ 
eted  something  like 

(NP  (ADVP  more  than  400,(X)0.000)  inhabitants). 

The  same  requirement  would  also  be  imposed  for  phrases  such  as  more  than  likely. 

3)  Flatten  cenain  common  sequences  involving  preposition,  noun,  preposition  such  as 
in  light  of  and  in  violation  of.  Here  again,  there  is  a  diversity  of  practice  in  our  group 
as  to  whether  the  preposition,  noun,  preposition  sequence  is  treated  as  a  multi-word 
preposition  or  has  NP  and  PP  structures  built  between  the  words,  as  exemplified  by: 

(PP  in  (NP  light  (PP  of  (NP  his  success)))) 

A  neutral  bracketing  of  this  phrase  is 

(PP  in  light  of  (NP  his  success)) 

4)  Delete  copular  be  when  it  precedes  an  adjective.  A  phrase  such  as  is  happy  to 
leave  would  receive  both  of  the  following  bracketings  in  our  data:  ((is  happy)(to 
leave))  and  (is  (happy  (to  leave))).  The  deletion  policy  will  eliminate  any  crossings 
for  this  type  of  phrase. 

Even  with  these  additional  policies,  there  is  still  a  residual  set  of  eight  sentences  with 
crossings  for  some  of  the  grammars  (excluding,  for  the  sake  of  brevity,  some  sentences 
for  which  the  NYU  grammar  has  crossings).  We  present  here  the  eight  sentences 
along  with  a  discussion  of  the  differences  in  analyis  that  led  to  the  crossings: 

1.  The  petition  listed  the  mayor’s  occupation  as  attorney  and  his  age  as  71. 

The  standard  analyzes  this  by  coordinating  listed  ...  as  attorney  with  his  age  as  71. 
(The  second  coordinate  is  taken  to  be  a  verb  phrase  with  an  cllipted  verb.)  One  of  us 
prefers  an  analysis  in  which  the  mayor’s  occupation  as  attorney  and  his  age  as  71  zn 
treated  as  the  coordinated  constituents,  creating  a  phrase  crossing  with  the  first  coordi¬ 
nate  phrase  of  the  standard. 

2.  His  political  career  goes  back  to  his  election  to  city  council  in  1923. 

The  standard  analysis  makes  a  constituent  out  of  back  to  ...  1923  while  one  of  our  ana¬ 
lyses  posnilates  goes  back  as  a  constituent 

3.  All  Dallas  members  voted  with  Roberts,  except  Rep.  Bill  Jones,  who  was  absent. 

The  standard  attaches  non-restrictivc  relative  clauses  to  NP.  In  this  case  who  was 
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absent  is  attached  to  Rep.  Bill  Jones.  Two  of  us  attach  non-restricdve  relative  clauses 
at  the  sentential  level. 

4.  The  odds  favor  a  special  session,  more  than  likely  early  in  the  year. 

The  standard  attaches  more  than  likely  early  in  the  year  to  the  NP  associated  with  ses~ 
Sion  while  some  of  us  attach  it  higher. 

5.  The  year  will  probably  start  out  with  segregaton  still  the  most  troublesome  issue. 

One  of  our  grammars  attaches  the  adverb  probably  at  a  low  level  to  the  verb  start 
while  that  standard  associates  it  with  the  S  and  specifies  a  verb  phrase  from  start  to 
the  end  of  the  sentence,  which  produces  a  crossing. 

6.  The  dinner  is  sponsored  by  organized  labor  and  is  scheduled  for  7  pjn. 

The  standard  coordinates  is  sponsored  by  organized  labor  and  is  scheduled  for  7  p.m. 
while  another  analysis  coordinates  the  dinner  is  sponsored  by  organized  labor  with  is 
scheduled  for  7  p.m. 

7.  He  is  willing  to  sell  it  just  to  get  it  off  his  hands. 

There  is  significant  disagreement  in  our  group  over  how  to  attach  the  phrase  Just  to  get 
it  off  his  hands.  The  standard  attaches  it  under  the  root  S,  while  others  attach  it  vari¬ 
ously  to  phrases  beginning  with  is  willing,  willing,  and  sell.  (A  recursive  attachment 
to  is  willing  to  sell  it  would  not  produce  a  crossing  with  the  standard.) 

8.  Mr.  Reama,  far  from  really  being  retired,  is  engaged  in  industrial  relations  coun¬ 
seling. 

The  standard  takes  /or  as  an  adverb  that  subcategorizes  a  PP,  while  one  of  our  gram¬ 
mars  treats  far  from  as  a  muld-word  lexical  item. 

In  conclusion,  we  believe  that  the  degree  of  disagreement  that  remains  after  the  appli¬ 
cation  of  our  deletion  and  restructuring  method  does  not  pose  a  significant  barrier  to 
the  use  of  hand  bracketed  corpora  for  evaluation  purposes  for  most  of  our  grammars. 
However,  the  amount  of  data  that  we  have  been  able  to  examine  so  far  is  limited  and 
our  judgements  about  the  success  of  the  method  are  still  tentative.  We  will  continue 
with  our  hand  analyses,  but  also  start  to  use  the  evaluation  program  with  the  real  out¬ 
put  of  our  paners  in  a  realistic  test  of  the  complete  evaluation  methodology.  We 
invite  other  groups  to  participate  and  will  make  our  evaluation  software  (which  runs  in 
Common  Lisp)  available. 
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1  Introduction 

This  paper  describes  an  effort  to  construct  a  catalogue  of  syntactic  data  which 
is  intended  eventually  to  exemplify  the  major  syntactic  patterns  of  the  German 
language.  Our  purpose  in  developing  the  catalogue  and  related  facilities  is  to 
obtain  an  empirical  basis  for  diagno.sing  errors  in  natural  language  processing 
systems  analyzing  German  syntax,  but  the  catalogue  may  also  be  of  interest  to 
theoretical  syntacticians  and  to  researchers  in  speech  and  related  areas.  The  data 
collection  differs  from  most  related  enterprises  in  two  respects:  (i)  the  material 
consists  of  systematically  and  artificially  constructed  sentences  rather  than  natu¬ 
rally  occurring  text,  and  (ii)  the  material  is  annotated  with  information  about  the 
syntactic  phenomena  illustrated,  which  goes  beyond  tagging  parts  of  speech.  The 
catalogue  currently  treats  verb  government,  {including  reflexive  verbs  and  verbal 
prefixation)  and  coordination. 

The  data  consists  of  linguistic  exprcjisions  (mostly  short  sentences  designed  to  ex¬ 
emplify  one  syntactic  phenomenon)  together  with  annotations  describing  selected 
syntactic  properties  of  the  expression.  The  annotations  of  the  linguistic  materia! 
serve  (i)  to  cla.ssify  construction  types  in  order  to  allow  selected  systematic  testing 

'This  work  was  undcriaken  wilii  financial  support  from  the  German  Ministry  for  Research 
and  Technology  (IIMFT)  and  the  LIl.OG  project  of  the  IBM  Corporation 
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of  specific  areas  of  syntax,  c.g.,  coordination;  and  (ii)  to  provide  a  linguistic  knowl¬ 
edge  base  supporting  the  research  and  development  of  natural  language  processing 
(NLP)  systems.  Besides  classificatory  information,  the  annotations  contain  infor¬ 
mation  about  the  precise  structure  of  the  sentence  such  as  the  position  of  the 
finite  verb  and  the  positions  of  other  phrases. 

In  order  to  probe  the  accuracy  of  NLP  systems,  especially  the  detection  of  un¬ 
wanted  overgeneration,  the  test  material  includes  not  only  genuine  sentences,  but 
also  some  syntactically  ill-formed  strings. 

The  syntactic  material,  together  with  its  annotations  is  being  organized  into  a 
relational  database  in  order  to  ease  access,  maintain  consistency,  and  allow  vari¬ 
able  logical  views  of  the  data.  The  database  system  is  in  the  public  domain  and 
is  (mostly)  independently  supported. 

Our  intent  is  to  make  public  this  work — both  the  test  material  and  the  database  of 
annotations.  We  plan  to  share  this  work  first  with  selected  contributing  partners, 
and  later  with  the  go  leral  research  and  development  communit)'. 


2  Goals  of  a  Diagnostics  Tool 

Our  goal  in  collecting  and  annotating  syntactic  material  is  to  develop  a  diagnostic 
tool  for  natural  language  processing  systems,  but  we  believe  the  material  may  be 
of  interest  to  other  researchers  in  natural  language,  particularly  syntactic  theoreti¬ 
cians.  Finally,  although  this  is  not  an  evaluation  tool  by  itself,  our  work  points  to 
possiblities  for  evaluating  systems  of  syntactic  analysis  by  allowing  the  systematic 
verification  of  claims  about,  and  investigation  of,  the  coverage  and  precision  of 
systems. 


2.1  Natural  Language  Processing 

There  is  general  consensus,  both  in  theoretical  computational  linguistics  and  in 
practical,  industrially  sponsored  research  in  natural  lang\iage  processing,  that 
systems  for  syntactic  analysis  (parsing,  recognition  and  classification)  are  possible 
and  valuable.  The  applications  of  syntactic  analysis  currently  under  investigation 
include  grammar  and  style  checking;  machine  translation;  natural  language  unter- 
standing  (particularly  interfaces  to  databases,  expert  systems,  and  other  software 
systems);  information  retrieval;  speech  synthesis;  and  speech  recognition.  The 
potential  impact  of  syntactic  analysis  technology  is  technically  and  financially 
profound. 
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But  if  uc  Aic‘  to  refill/*,  the  full  benefits  of  syntactic  analysis,  then  we  must  ensure 
that  correct  analyses  are  provided.  The  development  of  a  diagnostic  tool  serves 
just  this  purpose — pointing  out  where  analyses  are  correct,  and  where  incorrect. 
There  are,  of  course,  other  measures  of  quality  which  apply  to  natural  language 
software,  e.g  ,  general  software  standards.  Systems  which  perform  syntactic  anal¬ 
ysis  are  naturally  subject  to  the  same  general  standards  of  software  quality  that 
are  imposed  throughout  the  software  engineering  field,  e.g.,  efficiency,  modularity, 
modifiability,  compatibility,  and  ease  of  installation  and  maintenance.  Special- 
purpose  systems  may  be  subject  to  further  standards;  e.g.,  interface  software  is 
generally  rcquiied  to  have  clear  and  intuitive  boundaries  (transparency).  Com¬ 
pared  to  such  general  software  standards,  correctness  of  syntactic  analysis  is  an 
orthogonal  criterion,  though  for  many  applications,  an  overriding  one.  Attending 
e::clusi’ely  to  general  software  standards  means  risking  incorrectness — whether 
this  be  incorrectness  of  matrix  multiplication  in  a  linear  algebra  package  or  mis- 
analyses  in  a  natural  language  parser.  The  ultimate  costs  of  such  misanalysis 
depend,  of  course,  on  the  particular  application,  but  these  costs  may  easily  out¬ 
weigh  the  benefits  of  the  system  deployed. 

The  impoitancc  of  precision  in  syntactic  analysis  is  occasionally  disputed.  It  is 
pointed  out,  for  example,  that  humans  make  speech  errors  (and  typos),  and  that 
natural  language  understanding  systems  will  have  to  be  sufficiently  robust  to  deal 
with  these.  Here,  it  is  claimed,  less  precise  systems  may  even  have  an  advantage 
over  more  exact,  and  hence  “brittle”  competitors.  What  is  correct  about  this  point 
is  that  systems  should  be  able  to  deal  with  ill-formed  input.  What  is  questionable 
is  the  suggestion  that  one  deal  with  it  by  relaxing  syntactic  or  other  constraints 
generally  (although  it  might  be  quite  reasonable  to  use  constraint  relaxation  where 
no  exact  analysis  may  be  found — Jis  a  processing  strategy). 

The  problem  with  general  constraint  relaxation  is  that  t\  inevitably  involves  not 
only  providing  analyses  for  ill-formed  input  (as  intended),  but  also  providing  ad¬ 
ditional  incorrect  analy.ses  for  well-formed  input — “spurious  ambiguity”.  To  see 
this,  consider  agreement,  probably  a  good  candidate  for  a  less  important  “detail” 
of  syntax  which  might  safely  be  ignored.  For  example,  it  might  be  argued  that 
sentence  (1 )  below  ought  to  be  regarded  as  syntactically  acceptable,  since  it's  clear 
enough  what’s  intended: 

(1)  Lisle  allc  Sekreiarinnen,  die  einen  PC  bcnulzt 
List  all  .secretaries  who  uses  a  PC 

Syntactically  tolerant  systems  would  accept  this  sentence,  but  they  would  then 
have  no  way  of  di.stinguishing  correct  and  incorrect  parses  of  sentences  such  as 
(2),  which  arc  distinguished  only  by  agreement: 

(2)  Lisle  jede  Sekreidrin  in  Fmanzabtcilungcn,  die  cinen  PC  benutzt 
List  every  secretary  in  finance  dci>artmcnts  who  uses  a  PC 
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The  relative  clause  </>'■  cnjcn  PC  benutzt  can  of  course  only  be  understood  as 
modifying  Stkrdhrin  (the  only  NP  with  which  it  agrees),  but  a  system  which 
ignored  agreement  information  would  have  no  way  of  eliminating  the  parse  in 
which  the  relative  clause  is  construed  as  modifying  FtnanzabUilungen. 

Furthermore,  even  if  wc  accepted  the  argument  that  some  applications  may  ignore 
syntactic  accuracy,  we  are  still  faced  with  the  applications  at  the  other  end  of  the 
spectrum  of  syntactic  sensitivity,  i.e.,  applications  where  syntactic  accuracy  is 
essential.  Applications  of  this  sort  arc  found  where  the  microstructurc  of  the  text 
plays  an  important  role,  e.g.,  grammar  or  style  checking,  and  generally  the  entire 
area  of  NL  generation:  clearly,  nobody  wants  a  system  which  over-generates  in 
synthesis.  Similarily  it  is  hard  to  find  any  advantage  for  underconstrained  systems 
in  applications  such  m.s  speech  understanding,  where  the  whole  point  of  using 
sjntact.c  information  is  to  reduce  the  number  of  hypotheses — a  goal  served  only 
by  maximally  constrained  systems. 

We  therefore  believe  that  syntactic  precision  is  indispensable  for  some  applications 
and  valuable  even  in  applications  in  which  ill-formed  input  may  be  expected. 

The  diagnostic  tool  assesses  correctness  of  syntactic  analysis— it  supports  the 
recognition  of  bugs  in  the  linguistic  analysis.  This  in  turn  provides  both  a  means 
of  assessing  the  effects  of  proposed  changes  in  syntactic  analysis  as  well  as  a  means 
of  tracking  progress  in  system  coverage  over  time.  Neither  of  these  deriative  tasks 
is  realistically  feasible  without  the  aid  of  an  automated  tool.  Humans  may  spot 
individual  errors  when  attending  propitiously,  but  we’re  poor  at  systematic  checks 
and  comparisons,  especially  in  large  systems  created  by  groups  over  relatively  long 
periods  of  time. 


2.2  Linguistic  Research 

This  is  an  appropriate  point  at  which  to  acknowledge  our  own  debt  to  descriptive 
and  theoretical  linguistics,  from  which  our  primary  data— the  German  sentences 
themselves— have  been  gathered.  We  expect  to  reciprocate,  i.e.,  we  expect  that 
descriptive  linguistics  and  even  linguistic  theory  may  benefit  from  the  data  col¬ 
lection  effort  wc  have  undertaken.  These  benefits  may  take  different  forms:  first, 
we  have  begun  gathering  the  data  in  a  single  place;  second,  wc  arc  organizing  it 
into  a  database  in  a  fairly  general  way,  i.e.  with  relatively  little  theoretical  prej¬ 
udice,  so  that  variable  pcrsjicctives  on  the  data  are  enabled;  third,  in  addition  to 
relatively  crude  data  analysis  routinely  provided  in  linguistic  data  collections — 
which  seldom  extends  beyond  marking  ill-formcdncss/well-formcdncss,  wc  have 
provided  further  fundamental  data  annotations.  Fourth,  and  most  inlriguingiy  , 
the  time  mav  not  be  distant  when  linguistic  hypotheses  may  be  tested  directly  on 
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the  computer.  Many  contemporary  computational  systems  for  natural  language 
syntax  are  based  on  ideas  of  current  interest  in  theoretical  linguistics  as  well,  and 
there  is  interest  in  general  machinery  for  implementing  syntactic  analysis  for  wide 
varieties  of  linguistic  theories.  At  that  point,  the  use  of  diagnostic  tools  will  be  of 
immediate  interest  in  linguistic  research  as  well. 

In  sketching  these  potential  benefits  of  the  general  data  collection  and  analysis 
effort  we  have  begun,  it  should  be  clear  that  we  don’t  intend  to  speak  only  to 
linguists  emploring  “corpus-based"  methodologies:  our  information  includes  facts 
about  the  ill-formedness  of  strings  as  well  as  rudimentary  data  analysis.  This  will 
become  clearer  below. 


2.3  Toward  Evaluation 

The  catalogue  of  syntactic  material  we  have  collated  is  intended  for  deployment  in 
diagnosis — the  recognition  and  characterization  of  problems  in  syntactic  analysis. 
This  is  a  task  different  from  general  system  evaluation,  which  in  most  cases  will 
judge  the  performance  of  a  system  relative  to  the  achievement  of  a  goal  which 
is  set  by  an  application.  Even  if  we  limit  evaluation  to  the  performance  of  the 
syntactic  component  of  a  system,  there  are  still  some  differences  which  have  to 
kept  in  mind. 

The  contrast  between  diagnosis  and  evaluation  can  be  appreciated  if  one  considers 
the  case  of  applying  our  diagnostic  tool  to  two  different  systems.  In  virtually  every 
case,  the  result  we  obtain  will  show  that  neither  system  is  perfect  (nor  perfectly 
incorrect),  and  that  neither  one  analyzes  exactly  a  subset  of  the  constructions  of 
the  other.  Suppose,  for  the  sake  of  illustration,  that  one  system  is  superior  in 
treating  long-distance  (multi-clausal)  dependencies,  while  the  other  is  better  at 
simple  clause  structure,  but  that  the  performance  of  the  two  systems  is  otherwise 
the  same.  The  diagno.sis  is  complete,  but  the  evaluation  still  needs  to  determine 
the  relative  importance  of  the  areas  in  which  coverage  diverged.*  If  matters  were 
always  as  simple  as  in  this  illustration,  we  might  appeal  to  a  consensus  of  informed 
opinion,  which  would  in  this  case  certainly  regard  the  treatment  of  simple  clause 
structure  as  more  important  than  that  of  long-distance  dependencies — and  would 
therefore  evaluate  the  systems  accordingly.  But  matters  need  not  and  normally 
are  not  so  simple  at  all.  There  simply  is  not  a  consensus  of  informed  opinion 
about  the  relative  importance  of  various  areas  of  grammatical  coverage. 

'Strictly  speaking,  this  is  not  necessary;  we  could  evaluate  all  such  cases  as  equally  proficient, 
but  (i)  the  results  of  such  "evaluation"  would  be  loo  coarse  to  be  of  much  use;  and  (ii)  this  simply 
goes  against  good  .sen.se  Some  areas  of  grammatical  coverage  simply  are  more  important  than 
others.  See  the  exaniph-  m  text,  where  simple  clause  structure  is  certainly  more  important 
long-dislanee  (iiiiilli-claii.sal)  dependency. 
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Some  crucial  information  ihal  is  lacking  from  our  catalogue  of  syntactic  material 
is  information  about  relative  frequency  of  occurrence.  If  this  information  could 
be  obtained  and  added  to  the  database,  then  it  should  be  possible  to  develop  an 
evaluation  system  of  sorts  from  our  diagnosis  system.^ 


3  The  Diagnosis  Facility 

We  include  here  a  brief  description  of  the  diagnostic  facility;  more  detailed  docu¬ 
mentation,  especially  for  the  various  areas  of  coverage  of  the  syntactic  catalogue, 
is  currently  under  preparation. 


3.1  Sentence  Suite 

As  noted  in  the  introduction,  our  material  consists  of  sentences  we  have  carefully 
constructed  to  illustrate  syntactic  phenomena;  we  have  not  attempted  to  collect 
examples  from  naturally  occurring  text.  Several  considerations  weighed  in  favor 
of  using  the  the  artificially  constructed  data: 

•  since  the  aims  are  error  detection,  support  of  system  development,  and  eval¬ 
uation  of  systematic  coverage,  we  need  optimal  control  over  the  test  data. 
Clearly,  it  is  easier  to  construct  data  than  to  collect  it  naturally  when  we 
have  to  examine  (i)  a  systematic  range  of  phenomena  or  (ii)  very  specific 
combinations  of  phenomena. 

•  we  wished  to  include  negative  (ill- formed ness)^  data,  in  order  to  test  more 
precisely  (cf.  discussion  in  Section  2.1  on  “spurious  ambiguity"  and  also  on 
the  needs  of  generation).  Negative  data  is  not  available  naturally. 

•  we  wished  to  keep  the  diagnostic  facility  small  in  vocabulary.  This  is  desir¬ 
able  if  we  are  to  diagnose  errors  in  a  range  of  systems.  The  vocabulary  used 
in  the  diagtioslir  tool  must  cither  (i)  be  found  in  the  system  already,  or  (ii) 
be  added  to  it  easily.  But  then  the  vocabulary  must  be  limited. 

•  we  wished  to  exploit  existing  collections  of  data  in  descriptive  and  theoret¬ 
ical  lingtii.stics.  These  arc  virtually  all  constructed  examples,  not  naturally 
occurring  text. 

•  data  con.struction  in  linguistics  is  analogous  to  the  control  in  experimental 
fields — it  allows  the  testing  of  maximally  precise  hypotheses. 

^Diil  it  IS  nol  clear  liial  this  is  the  best  way  to  go  about  developing  an  evaluation  system.  For 
example,  we  arc  not  making  any  effort  to  keep  some  of  the  material  secret,  as  speech  evaluation 
systems  routinely  do  in  order  to  prevent  a  bias  toward  test  material. 


84 


Wc  have  no  objection  to  including  naturally  occurring  data  in  the  catalogue, 
subject  to  the  restrictions  above  (especially  constraining  the  size  of  the  facility). 

The  vocabulary  for  the  test  suite  has  been  taken  front  the  domain  of  personnel 
management  wherever  possible.  We  chose  this  domain  because  it  is  popular  in 
natural  language  processing,  both  as  a  textbook  example  and  as  an  industrial  test 
case.  The  domain  of  personnel  management  would  also  be  u.seful  in  case  we  are 
to  diagnose  errors  in  semantics  as  well  as  syntax  (which  we  are  not  attempting 
to  do  at  present,  but  which  is  an  interesting  prospect  for  the  future).  It  presents 
a  reasonably  constrained  and  accessible  semantic  domain.  Where  no  suitable 
vocabulary  from  the  domain  of  personnel  management  presented  itself,  we  have 
extended  the  vocabulary  in  ad  hoc  ways. 

The  suite  of  test  sentences  is  being  collated  by  various  contributors,  each  spe¬ 
cializing  in  a  single  area  of  coverage,  e.g.  verb  government,  coordination,  or  NP 
constructions.  Because  of  the  range  of  syntactic  material  which  is  eventually  to 
be  included,  it  is  difficult  to  draw  precise  guidelines  about  the  sentences. 

Still,  several  factors  have  been  borne  in  mind  while  constructing  the  syntactic 
examples. 

•  lexicon  size  (cf.  above) 

•  adherence  to  the  following  standards:  (somewhat)  formal,  conversational 
High  German;  i.e.,  we  have  avoided  colloquialisms,  literary  peculiarties,  and 
regional  dialects. 

a  selected  testing  of  negative  examples.  We  have  tried  to  keep  the  catalogue 
small,  but  not  at  the  cost  of  using  great  ingenuity  to  create  minimal  sets 
of  testing  data,  nor  at  the  cost  of  introducing  very  unnatural  examples  into 
the  test  catalogue.  We  have  not  rigorously  purged  superfluous  examples. 

•  minimization  of  irrelevant  ambiguity  (bearing  in  mind  that  it  cannot  be  fully 
eliminated). 

•  attention  to  analytical  problems.  We  have  attempted  to  catalogue  not  only 
the  constructions,  but  also  the  problems  known  to  be  difficult  in  their  anal¬ 
ysis. 

Wc  do  not  deceive  ourselves  about  our  chances  for  success  with  respect  to  the  last 
point:  our  catalogue  is  doubtlessly  incomplete  in  many  respects,  but  most  sorely 
in  this  one.  We  invite  comment  and  contribution  everywhere,  but  most  especially 
in  further  cataloguing  the  known  analytical  problems  in  German  syntax. 

In  stressing  our  intention  to  catalogue  analytical  problems  as  well  as  the  basic 
range  of  syntactic  construction  types,  we  do  not  intend  to  suggest  that  we  in¬ 
tend  to  gather  a  collection  of  “cute  examples".  We  will  gather  cute  examples, 
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blit  these  are  relatively  few  in  the  general  catalogue.  Our  primary  goal  will  be 
a  coverage  of  phenomena  which  is  as  comprehensive  as  feasible,  even  if  this  in¬ 
volves  the  rather  tedious  compilation  of  theoretically  relatively  well-explored  and 
scientifically  “uninteresting”  constructions,  such  as  the  full  paradigms  illustrating 
determiner-adjectivc-noun  agreement  in  German  or  the  diffe^’ent  types  of  verbal 
subcategorization.  From  our  experience,  it  is  above  all  the  absence  of  systematic 
and  comprehensive  test-beds  which  hampers  system  development,  rather  than  the 
lack  of  ingenious  examples  (which  frustrate  all  systems  in  some  way  or  other).  Our 
goal  is  thus  not  primarily  to  show  what  systems  cannot  do,  but  to  support  the 
extension  of  what  they  can  do. 


3.2  Syntactic  Annotations 

In  choosing  which  annotations  about  the  sentences  might  be  sensible,  we  have 
been  guided  by  two  considerations.  First,  the  catalogue  will  be  much  more  useful 
if  examples  from  selccicd  areas  can  be  provided  on  demand.  For  example,  it  would 
be  useful  to  be  able  to  ask  for  examples  of  coordination  involving  ditransitive 
verbs — as  opposed  to  simply  coordination  (an  area  of  coverage).  This  means  that 
we  need  to  provide  annotations  about  which  area  of  coverage  a  given  sentence  (or 
ill-formed  string)  is  intended  to  illustrate.  With  regard  to  these  annotations,  we 
have  merely  attempted  lo  use  standard  (traditional)  linguistic  terminology. 

Second,  we  can  exploit  some  annotations  to  check  further  on  precision  of  analysis. 
This  is  the  purpose  of  annotations  such  as: 

t  well-formed  vs.  ill-formed 

•  position  of  finite  matrix  verb 

•  position  of  NP’s 

•  position  of  PP’s 

So,  in  a  sentence  such  as  (3),  the  following  database  values  arc  encoded: 

(3)  Der  Student  biilct  den  Manager  uin  den  Vcrlrag. 
the  student  asks  the  manager  for  the  contract 


70K 

OK 

finite  matrix  verb 

3 

position  of  NP’s 

1-2,  4-5,  7-8 

position  of  PP’s 

6-8 
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In  selecting  these  properties  as  worthy  of  annotation,  we  were  motivated  primarily 
by  a  wish  to  focus  on  ])iopcriies  about  whicli  there  would  be  little  theoretical 
dispute,  which  would  be  relatively  easy  to  test,  and  which  would  still  provide  a 
reasonable  reflection  of  a  system’s  accuracy. 


3.3  An  Example:  Verbal  Government 

One  of  the  phenomena  which  the  data  collection  already  covers  is  the  area  of 
verba!  government,  i.e.,  verbal  subcategorization  frames.  The  aim  was  to  compile 
a  comprehensive  list  of  combinations  of  obligatory  arguments  of  verbs,  forming 
the  basis  of  different  sentence  patterns  in  German.  We  ignore  both  adjuncts  and 
optional  arguments  in  restricting  ourselves  to  obligatory  arguments,  which  can  be 
tested  by  an  operationalizable  criterion,  a  specific  sort  of  right  extraposition: 

(4)  Er  hat  gegessen,  und  zwar  Bohnen. 
he  has  eaten,  namely  beans. 

(5)  *Er  hat  verzchrl,  und  zwar  Bohnen. 
he  has  consumed,  namely  beans 

(6)  *Er  hat  das  Buch  gelegt,  und  zwar  auf  den  Tisch. 
he  has  put  the  book,  namely  on  the  table 

(7)  Er  hat  Maria  gekufii,  und  zwar  auf  die  Wange. 
he  has  kissed  Mary,  namely  on  the  cheek 

We  attempted  to  find  instances  of  all  possible  combinations  of  nominal,  preposi* 
tional,  sentential,  but  also  adjectival  complements.^  Clearly,  we  could  not  imme- 
'  diateiy  cover  the  entire  field  in  full  depth,  so  that  we  decided  to  adopt  a  breadth 
first  strategy,  c.g.,  we  ignored  the  more  finegrained  distinctions  to  be  made  in  the 
area  of  infinitival  complementation  or  expletive  complements.  The  description  in 
these  areas  will  be  elaborated  at  later  stages. 

The  result  of  the  collection  is  a  li.st  of  about  90  combinations  which  arc  exemplified 
in  about  300  sample  sentences. 

The  sentences  illustrate 

•  combinations  of  nominal,  prepositional  and  adjectival  arguments,  viz., 

-  nominal  arguments  only: 

the  basis  of  our  list  were  collections  to  be  found  in  the  literature,  such  as  (2],  [5],  [6],  |7], 
(9)  and  (12).  We  arc  also  grateful  to  Stcfanic  SchachtI,  Siemens  Munich,  who  provided  us  with 
some  of  her  material. 
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(8)  Dtr  hfanngtf  gihl  dcm  Sludcnicn  den  Computer. 
the  manager  gives  the  student  the  computer 

-  nominal  and  prepositional  u’gumenls  with  semantically  empty  (9)  or  non¬ 
empty  prepositions  (10): 

(9)  Dtr  VoTschlag  bringt  den  Studenten  auf  den  Losungsweg. 
the  suggestion  takes  the  student  to  the  solution 

(10)  Der  Manager  vermuiet  den  Studenten  in  dem  Saal. 
the  manager  assumes  the  student  in  the  hall 

-  nominal  and  adjectival  (or  predicative)  complements 

(11)  Der  Manager  wird  krank. 
the  manager  becomes  ill 

0  nominal  arguments  combined  with  finite  (subordinate)  clauses,  introduced 
by  the  complementizers  dafi  (12),  oh  (13)  or  some  wh*element  (14): 

(12)  Dafi  der  Student  kommt,  stimmt. 
that  the  student  comes,  is-correct 

(13)  Dem  Manager  enifallt,  oh  der  Student  kommt. 

it  escapes  the  manager,  whether  the  student  comes 

(14)  Der  Manager  fragt,  wer  kommt. 
the  manager  asks  who  comes 

«  nominal  arguments  in  combination  with  infinitival  complements,  illustrating 
bare  iitfinitives  (15)  and  ru-infmitives  (16): 

(15)  Der  Manager  kort  den  Studenten  kommen. 
the  manager  hears  the  student  come 

(16)  Der  Manager  behauptet,  den  Studenten  zu  kennen. 
the  manager  claims  to  know  the  student 

e  examples  involving  some  of  the  combination  above  in  connection  with  ex¬ 
pletive  or  correlative  prepositional  pronouns  or  expletive  es: 

(17)  Der  Vorschlag  dient  dazu,  den  Plan  zu  erkldren, 
the  proposal  serves  (to-it)  to  explain  the  plan 

(18)  Der  Manager  acktet  daranf,  ob  der  Student  kommt. 
the  manager  checks  (on-it)  whether  the  student  comes 

(19)  Es  grlingl  dcm  Studenten,  zu  kommen. 
it  succeeds  to  the  student,  to  escape 
"TIjc  student  succeeds  in  escaping" 

(20)  Dcr  Manager  hall  cs  fur  notwendig,  zu  kommen. 
the  manager  considers  it  (for)  necessary  to  come 
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Since  we  are  iniercstp*!  only  in  verbal  government  here,  we  tried  to  keep  as  many 
other  parameters  as  possible  carefully  under  control:  as  already  mentioned,  the 
noun  phrases  in  the  sample  sentences  are  built  from  a  limited  vocabulary.  All 
noun  phrase  and  prepositional  complements  have  a  definite  determiner.  In  the 
case  of  prepositional  phrases  the  fusion  of  preposition  and  determiner  (m  dem 
— ♦  im)  is  avoided.  'Since  German  has  relatively  free  word  order,  the  different 
complements  have  to  be  identified  by  their  case  marking  in  most  cases — as  a 
consequence,  morphological  ambiguities  of  case  (e.g.  between  feminine  or  neuter 
nominative  and  accusative)  were  excluded.  The  matrix  and  subordinate  clauses 
all  have  only  one  verbal  head  (i.e.,  they  do  not  have  any  auxiliary  or  modal 
verbs),  whose  morphological  form  is  the  third  person,  singular,  present,  indicative 
form  if  possible.  The  sentences  do  not  contain  any  additional  irrelevant  modifiers, 
adjuncts  or  particles.  The  word  order  of  the  sample  sentences  is  meant  to  illustrate 
the  “un-marked”  order,  although  this  should  not  play  an  important  role,  since  the 
complements  are  uniquely  case  marked,  as  mentioned. 

Every  combination  of  complements  is  illustrated  by  at  least  one  example.  In 
addition,  each  sentence  is  paired  with  a  set  of  ill-formed  sentences,  which  illustrate 
three  types  of  errors  relevant  for  verbal  government: 

•  an  obligatory  argument  is  missing; 

•  there  is  one  argument  too  many; 

•  one  of  the  arguments  has  the  wTong  form. 

The  material  is  organized  in  a  relational  database,  such  that  queries  can  ask 
either  for  a  description  or  classification  of  a  sentence  or  for  sentences  matching 
combinations  of  descriptive  parameters. 

In  describing  the  argument  structure  of  the  sentences  we  chose  a  vocabulary  which 
is  of  course  not  theory  neutral,  but  which  at  least  can  be  expected  to  meet  com¬ 
mon  agreement.  We  tried  to  avoid  theory-specific  notions  such  as  subject  or  direct 
object^  and  identified  the  complements  on  the  basis  of  morphological  case  marking, 
prepositions,  complementizers  and/or  the  morphology  of  the  verb.  Obviously,  this 
vocabulary  cannot  exhaustively  characterize  the  properties  of  individual  comple¬ 
ments.  For  example,  with  those  few  verbs  which  subcategorizes  for  two  accusative 
NPs  it  is  quite  unlikely  that  they  both  NPs  behave  in  the  same  way  with  respect  to 
passivization.  Similarily,  a  nominative  complement  (“subject”)  may  have  different 
propertives  depending  on  the  verb  being  un-accusative  or  un-ergative.  However, 
we  think  that  distinctions  of  this  kind  should  be  dealt  with  separately  in  data  sets 
on  e.g.  passivization,  ergativity,  etc. 
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3.4  Database 


3.4.1  Abstract  Data  Model 

In  addition  to  the  relatively  straightforward  properties  of  sentences  noted  above 
(Section  3.2),  we  also  model  the  more  complex  classificatory  information  in  the 
catalogue. 

According  to  the  £ntity>Relationship  (ER)  terminology  (cf.  (S)),  we  can  iden¬ 
tify  two  entity  types  and  one  relationship  type  which  are  specified  as  follows: 

1.  Sentence,  an  entity  type,  the  major  concept  of  the  data  model.  An  en¬ 
tity  of  this  typ«’  includes  a  description  of  the  main  verb’s  valency  (i.e.,  the 
number  of  arguments  the  main  matrix  verb  governs  and  their  description), 
a  sentence  which  exemplifies  the  given  properties,  and  information  on  its 
wellformedness.  Each  entity  has  a  unique  identifier,  a  key  attribute  which 
facilitates  queries  for  description  or  classification  of  a  sentence.  (Given  the 
present  limited  range  of  data  and  the  underlying  area  (verb  government),  the 
attributes  argument-description  and  fin-matrix-verb  could  almost  be  used  to 
identify  a  sentence  entity  uniquely,  because  there  is  only  one  representative 
from  most  valency  types  in  SENTENCE.  But  some  types  are  represented 
more  than  once.) 

2.  Category,  an  entity  type.  Each  entity  of  this  type  (c.g.,  NP,  finite-malrix.verb) 
represents  a  category  which  appears  in  a  related  sentence. 

3.  Appears  Jn,  a  M:N  relationship  type^  between  CATEGORY  and  SENTENCE. 
Both  Category  and  Sentence  participations  in  the  relation  are  total. 
AppearsJn  ha.s  additional  attributes  specifying  the  position  of  a  given 
category  in  a  related  sentence  and  its  lexical  form. 

The  following  figure  illustrates  the  conceptual  model  of  the  databa.se  described 
above.  It  covers  the  area  of  verbal  government  and  can  be  easily  extended. 


^.M:N  relation  (many  to  many  relation):  a  lenience  entity  may  be  related  to  (i.e.  may  include) 
numeroui  category  entities,  and  a  category  entity  may  appear  in  numerous  sentence  entities. 
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Figure  1;  The  FR  sdicma  diagram  for  the  database  described  above. 


The  following  example  shows  database  entries  for  a  given  sentence. 

(21)  Der  Vorschlag  bringt  den  Studenicn  darauf,  daft  der  Plan  falsch  ist. 
the  suggestion  takes  the  student  to-it,  that  the  plan  is  wrong 

SENTENCE 

s*id  arg'description  m>m<v  ex  si  na  wf  err  com 

1012  nom.acc.cor.sc.dass  bringen  (si)  11  4  1  0 

(S'id  s  scntence>id,  m«m>v  s  iin«niatrix*vcrb,  ex  s  example,  si  s  sentence  length, 
na  s  number  of  arguments,  wf  s  wellformcdness,  err  s  error  code,  com  »  comment) 


CATEGORY 

category>description  comment 
cor  correlate 

fin*matrix-vcrb 
NI> 

sC'Comp  subordinate  clause 


APPEARS  JN 

sentence>id  category«description  pos>from  pos>to 
1012  cor  6  G 

1012  fin-matrix-verh  3  3 

1012  NP  1  2 

1012  NP  4  5 

1012  sc-comp  7  11 


substring 

darauf 

bringt 

der  Vorschlag 

den  Studenten 

daO  der  Plan  falsch  ist. 


A  new  database  entry  for  a  given  sentence  must  include  values  for  the  attributes 
arg-description,  fin-matrix-verb,  example,  wellformedness,  category-description, 
pos-from,  and  pos-to.  For  ill-formed  sentences  the  error  code  and  additional  com¬ 
ments  should  be  given.  All  other  attributes  can  be  inserted  through  some  triggers 
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inrliiding  ronsistency  checks.  Splitting  the  position  attribute  into  pos-from  and 
poS'to  makes  the  geixTalion  of  the  corresponding  substrings  possible  and  facili¬ 
tates  a  consistency  check  (e.g.,  pos-from  must  a  positive  integer  number  less  than 
pos-to,  pos-to  must  be  greater  than  pos-from  and  equal  or  less  than  the  sentence 
length.). 

3.4.2  Database  System 

The  database  is  administered  in  the  programming  language  awk  (cf.  [1]). 

Some  of  the  reasons  which  speak  in  favor  of  awk  are: 

•  awk  is  in  the  jniblic  domain  running  under  UNIX  and  should  run  in  other 
environments;  in  particular,  it  runs  on  MS-DOS. 

•  Its  ability  to  handle  strings  of  characters  as  conveniently  as  most  languages 
handle  numbers  makes  it  for  our  purposes  more  suitable  than  standard  re¬ 
lational  database  systems;  i.e.,  it  allows  more  powerful  data  validation,  in¬ 
creasing  the  availability  of  information  with  a  minimal  number  of  relations 
and  attributes. 

Compared  to  standard  databases  awk  has  a  restricted  area  of  application  and  does 
not  provide  fast  access  methods  to  information,  but  it  is  a  good  language  for  a 
developing  a  simple  relational  database  in  a  number  of  cases.  Additional  resources 
and  tools  such  as  a  report  generator  and  a  routine  for  consistency  checking  can 
be  easily  implemented. 

The  database  includes  a  reduced  sql-like  query  language.  We  use  the  database 
entries  of  the  example  given  above  to  ask  the  following  queries: 

(i)  retrieve  all  sentences  which  include  a  correlate  and  a  subordinate  clause  beginning 
with  Hafi. 

query:  retrieve  sentence-id,  example 
from  sentence 

where  match(arg-description,  "cor")  and  match(arg-description,  "sc-dass") 
result:  1012  dcr  Vorschlag  .  .  .  falsch  ist. 

(ii)  retrieve  the  position  and  the  lexica)  form  of  all  NT's  of  sentence  1012. 
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qnrry:  retrieve  cat-description,  position,  substring 
from  sentence,  appcarsJn 

where  sentence-id  =  1012  and  category-description  s  "NP" 

result:  NP  1  2  dcr  Vorschlag 
NP  *4  5  den  Studenten 


The  query  language  has  been  developed  under  SunOS  using  the  utilities  lex  and 
yacc.  lex  is  a  lexical  analyzer  generator  designed  for  processing  of  character  input 
streams,  yacc,  a  LALR(1}  parser  generator,  is  an  ancronym  for  Yet  Another 
Compiler  Compiler.  It  provides  a  general  tool  for  describing  an  input  language  to 
a  computer  programin. 

3.5  Auxiliary  Materials 

The  database  of  syntactic  material  is  to  be  accompanied  by  a  few  auxiliary  de¬ 
velopment  tools.  First,  in  order  to  support  further  development  of  the  catalogue 
and  database,  it  must  be  possible  to  obtain  a  list  of  words  used  (so  that  we  mini* 
mize  vocabulary  size),  and  a  list  of  differentiating  concepts  (so  that  categorization 
names  may  be  accessed  easily).  Second,  documentation  must  be  available  on  each 
of  the  areas  of  syntactic  coverage  included.  This  is  to  cover  (minimally)  the  de¬ 
limitation  of  the  area  of  coverage,  the  scheme  of  categorization,  and  the  sources 
used  to  compile  the  catalogue. 

Third,  a  small  amount  of  auxiliary  code  may  be  supplied  to  support  development 
of  interfaces  to  parsers.  This  need  not  do  more  than  dispatch  sentences  to  the 
parser,  and  check  for  the  correctness  of  results. 


4  Comparison  to  Other  Work 

This  appears  to  be  the  first  attempt  to  construct  a  general  diagnostic  facility 
for  German  syntax,  even  if  virtually  every  natural  language  processing  group 
working  on  German  has  a  small  suite  of  sentences  used  for  internal  monitoring 
and  debugging. 

There  have  been  several  related  efforts  concerned  with  English  syntax.  Guida  and 
Mauri  [8]  report  on  attempts  to  evaluate  system  performance  for  natural  language 
processing  systems  (n.6.,  not  merely  syntax)  in  which  they  attempt  to  finesse  the 
issue  of  correctness  (which  we  argue  to  be  central)  by  measuring  user  satisfaction. 
We  have  attempted  to  address  the  issue  of  syntactic  correctness  head-on. 
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Hewlett-Packard  Laboratories  compiled  a  test  suite  of  approximately  1,500  sen¬ 
tences  which  it  distributed  at  the  Public  Forum  on  Evaluating  Natural  Language 
Systems  at  the  1987  Meeting  of  the  Assocation  for  Computational  Linguistics  [4]. 
That  effort  differed' from  the  present  one  in  that  it  tried  to  evaluate  semantics 
and  pragmatics,  as  well  as  syntax,  and  in  that  it  consisted  essentially  of  sentences 
without  annotated  properties.  The  sentences  were  not  organized  into  a  database. 

Read  et  al.  [11]  advocate  a  “sourcebook"  approach,  in  which  fewer  examples  are 
submitted  to  much  closer  scrutiny.  The  closer  scrutiny  doesn't  seem  subject  to 
automation,  at  least  at  present.  Furthermore,  their  emphasis  is  on  evaluating 
systems  for  natural  language  understanding,  and  the  primary  focus  seems  to  be 
on  domain  modeling,  conceptual  analysis  and  inferential  capabilities,  not  syntax. 
It  is  similar  to  the  HP  approach  (and  to  ours)  in  employing  primarily  constructed 
examples,  rather  than  naturally  occurring  ones. 

The  Natural  Language  group  at  Dolt,  Bcranek,  and  Newman  Systems  and  Tech¬ 
nologies  Corporation  circulated  a  corpus  of  approximately  3, 200  sample  database 
queries  formulated  in  English  at  the  1989  DARPA  Workshop  on  Evaluating  Nat¬ 
ural  Language  (lOj.  The  emphasis  here,  too,  was  on  system  (natural  language 
understanding)  performance,  rather  than  specializations,  but  most  of  their  exam¬ 
ples  seem  to  come  from  actual  trial  use  of  a  natural  language  interface  program, 
which  gives  their  work  added  value. 

The  University  of  Pennsylvania’s  “Treebank"  project  (similar  to  a  project  of  the 
same  name  at  the  University  of  Lancaster  sponsored  by  IBM)  has  begun  an  effort 
to  annotate  naturally  occurring  text  and  speech,  and  to  organize  the  annota¬ 
tions  into  a  “Treebank”.  The  annotations  are  phonetic,  syntactic,  semantic  and 
pragmatic,  and  the  intended  scope  is  monumental.  Since  they  wish  to  gather  rep¬ 
resentative  and  varied  data,  they  hope  to  collect  and  annotate  approximately  10* 
words. 

Finally,  the  Text- Encoding  Initiative  of  the  Association  for  Computational  Lin¬ 
guistics  is  a  loosely  organized  confederation  of  efforts  concerned  with  the  classifi¬ 
cation  and  annotation  of  various  sorts  of  texts.  Our  work  will  be  made  available 
to  this  group. 


5  Current  State,  Future  Plans 

5.1  Collaborations 

We  have  contacted  some  research  groups  in  the  area  of  NLP  and  machine  trans¬ 
lation,  which  have  shown  interest  in  cooperating  on  the  effort  by  submitting  data 
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sets  in  exchange  for  the  use  of  the  database.  Among  tiiese  are  the  Institut  fur 
angewandte  Informal  ionswissenschaft  (l.AI),  Saarbrucken  and  a  research  and  de> 
velopment  group  at  Siemens,  Munich. 


5.2  Eventual  Range  of  Syntax  Catalogue 

As  mentioned,  we  regard  our  work  only  as  a  starting  point  which  has  to  be  com¬ 
plemented  by  contributions  from  other  groups  and  individual  experts.  As  to 
extensions  of  the  database,  we  can  only  provide  the  roughest  of  lists  here.  We 
intend  the  list  to  be  suggestive  rather  than  definitive: 

Syntax  of  the  simple  clause,  including  verbal  government  and  genera  verbi  (pas¬ 
sive,  etc.),  negation,  word  order,  and  adverbial  modification,  including  temporal 
adverbials  (duratives,  frequeiitatives,  and  ‘‘frame"  adverbials),  locative,  manner, 
and  measure  adverbials.  \'erb  phrase  complementation  including  argument  shar¬ 
ing  or  inheritance  {auf  Hans  ist  tr  stoh),  clause  union,  extraposition,  modal  and 
auxiliary  verbs.  Verbal  complex,  fixed  verbal  structures  {FunktionsverbgefUge)^ 
separable  prefix  verbs,  idioms  and  special  constructions. 

Noun  phrase  syntax,  including  determiner  and  numeral  (and  measure)  system, 
relative  clauses  of  various  sorts  (including  preposed  participial  phrases),  pre- 
and  postnominal  adjectival  modification,  noun  phrase  coordination,  and  plurals. 
Pronominal  system  and  anaphora. 

Prepositons  and  postpositions,  clilicization,  particles  (c.g.,  als,  ja,  je,  denn). 

Questions,  including  long-distance  (multi-clause)  dependence.  Imperative  and 
subjunctive  moods.  Adjectival  and  nominal  government,  modification,  and  speci¬ 
fication.  Equativc,  comparative,  and  superlative  constructions.  Coordination  and 
ellipsis. 
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Categoiirdng  the  underlying  capabilities  of  natural  language  processing  (NLP)  systems 
is  a  required  preliminary  step  in  the  development  of  test  suites  for  semantic  and  prag¬ 
matic  phenomena  and  of  evaluation  metrics  that  rate  systems  over  the  results  achieved 
against  the  benchmark  test  suites. 

To  get  beyond  black-box  evaluation  of  NLP  systems  and  make  some  progress  toward 
glass-box  evaluation,  especially  in  the  semantics  and  pragmatics  areas,  does  not 
depend,  it  seems  to  me,  on  an  overarching  agreement  on  exactly  how  to  characterize 
semantic  and  pragmatic  phenomena  or  on  being  able  to  map  from  one  representation  to 
another  or  into  a  presumed  superset  "neutral"  representation.  Rather,  if  we  ack¬ 
nowledge  that  a  fundamental  task  for  any  NLP  system  is  to  resolve  ambiguity,  then  we 
can  direct  much  of  our  attention  there,  acknowledging,  of  course,  the  need  to  assess 
additional  inferencing  requirements  as  well  as  performance  characteristics. 

For  instance,  suppose  we  agree  that  "the  girl  saw  the  boy  on  the  hill  with  a  telescope" 
is  five  ways  ambiguous  (has  five  distinct  meanings).  Then,  there  ought  to  be  five  ds- 
tinct  semantic  representations  provided  by  any  NLP  system  that  can  handle  the 
phenomenon.  Note  that,  for  semantic  evaluation,  it  is  not  necessary  to  require  that 
such  an  NLP  system  also  provide  five  distinct  syntactic  representations  (presumably  re¬ 
vealing  different  prepositional  phrase  attachments).  Whatever  syntactic  analyses  the 
system  provides  are  simply  irrelevant  to  our  purpose.  Additionally,  in  a  context  that 
pragmatically  forces  one  of  the  semantic  interpretations,  an  NLP  system  that  is  prag- 
m:tdcally  adequate  in  this  area  ought  to  make  tbe  right  choice.  To  evaluate  a  system’s 
capability  to  account  for  this  sort  of  ambiguity,  we  do  not  need  to  evaluate  precisely 
how  the  semantic  representations  are  achieved  or  what  their  precise  formulation  hap¬ 
pens  to  be.  Nor  do  we  need  to  know  exactly  how  a  system  selects  the  correct  in¬ 
terpretation  in  an  appropriate  context  It  is  enough  to  know  ("check  off)  that  all  five 
interpretations  are  available  and  that  the  correct  choice  is  made. 

Test  suites  will,  of  course,  necessarily  exemplify  a  vttide  range  of  phenomena  at  both 
the  sentence  and  discourse  levels.  Those  involving  ambiguity  resolution  per  se  include 
word  ambiguity,  structural  ambiguity,  anaphora  and  definite  reference  resolution,  and 
quantifier  and  negation  ambiguity  (scoping).  Additional  evaluation  criteria,  which 
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largely  go  beyond  ambiguity  resolution  as  normally  defined,  include,  for  example,  how 
well  a  system  establishes  discourse  relations  (causal  and  temporal,  among  others),  how 
well  a  system  draws  inferences  (lexical  and  strucnual,  both  guaranteed  and  invited), 
and  how  fast  and  how  well  a  system  arrives  at  its  results  (both  time  and  space  com¬ 
plexity  are  relevant).  I  would  maintain  that  developing  evaluation  metrics,  and  the  test 
suites  themselves,  depends  on  establishing  a  valid  categorization  of  NLP  systems  that 
is  independent  of  the  tests  themselves. 

The  categorization  of  NLP  systems  according  to  their  capabilities,  both  acmal  and  pro¬ 
jected,  will  afford  insight  in  several  ways.  First,  it  will  instruct  us  primarily  on  what 
sons  of  test  suites  to  construct  and  on  what  evaluation  measures  to  use.  Second,  it 
will  tell  us  a  great  deal  about  the  purposes  of  the  various  systems  that  are  being 
developed  by  the  computational  linguisdcs  conununity.  Some  are  much  more  limited 
in  their  intended  uses  than  others;  some  are,  however,  very  ambidous  in  their  aims. 
Third,  it  will  give  us  not  only  a  means  to  evaluate  systems  comparadvely,  but  to  assess 
with  some  coaadence  any  given  system's  inherent  coverage  and  extensibility  at  any 
stage  in  its  development  In  short  we  need  to  have  benchmarks  that  are  valid  not  just 
for  fully  developed  systems,  but  a  reliable  means  of  gauging  the  progress  that  is  being 
made  by  the  NLP  community  and  by  pardcular  groups  within  it  along  the  development 
path  toward  robust  NLP  systems. 

The  fundamental  consideradons  by  which  a  system  can  be  categorized  comprise  at 
least  an  undentanding  of  the  following  differendators. 

1.  The  intended  meaning  of  the  semantic  representations.  What  is  the  cognidve 
and/or  real-world  validity  that  the  representadons  are  intended  to  model?  Is  the  in¬ 
tended  meaning  implicit  or  explicit?  For  example,  for  the  sentence  John  loves  Mary  is 
the  representadon  rather  like  ’loves(John,  Mary)’,  where  the  interpretadon  is  external 
to  the  representadon,  or  is  it  more  like  ’loves(cognizerJohn,  range:  Mary)’,  where  it  is 
supposed  that  the  interpretadon  of  the  roles  that  John  and  Mary  pHty-in  the  sentence  is 
fixed  ("cognizer"  and  ’’range",  respecdvely)?  The  laner  sort  of  semandc  representadon 
seems  to  carry  with  it  assumpdons  about  cognidve  and/or  teal  world  vali^ty.  More¬ 
over,  for  representadons  that  attempt  to  make  the  intended  interptetadons  explicit,  we 
can  entertain  making  up  tests  that  assess  the  ’’self-awareness"  of  the  system  in  the 
sense  that  we  might  suppose  that  such  a  system  could  answer  quesdons  about  "how  it 
knows  what  it  knows". 

2.  The  dosed*  or  open*world  assumptions  that  are  made  and  the  values  that  the 
logic  is  capable  of  supporting.  G  construe  the  word  logic  here  in  its  most  general 
sense  so  as  to  include  whatever  reasoning  prindples  a  system  supports.)  Now,  I  take 
it,  rightly  or  wrongly,  that  a  system  that  has  only  a  two-valued  logic  is  a  system  that 
makes  a  closed-world  assumpdon.  For  instance,  in  a  system  that  has  a  Prolog-style 
logic,  ’yes’  means  "demonstrable  from  the  informadon  at  hand"  while  ’no’  means  "not 
demonstrably  ’yes’".  A  system  with  a  three-valued  (or  higher)  logic  will  support  an 
open-world  assumpdon.  Thus,  given  informadon  that:  All  men  are  mortal,  Socrates  is 
a  man,  and  Jill  is  not  a  man,  a  two-valued  system  will  respond  to  the  query  Is  Jill 
mortal?  with  ’no’;  but  it  will  also  repon  ’no’  to  the  query  Is  Sam  mortal?,  about 
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which  we  have,  by  hypothesis,  no  infomiation.  A  three-valued  system  ought  to 
respond  with  'don’t  know’,  or  the  equivalent,  to  the  last  inquiry.  Systems  that  suppon 
fuzzy  or  probabalisdc  logics  might  or  might  not  make  open-world  assumptions.  If 
they  do,  then  it  is  fair  to  test  them  over  such  a  suite  of  inquiries. 

3.  What  kinds  of  valid  conclusions  can  be  drawn  using  the  logic  and  reasoning 
system.  It  seems  to  me  that  we  do  not  very  much  care  from  an  evaluation  standpoint 
whether  a  system  uses  deductive,  inductive,  or  abductive  inferencing  strategies.  What 
we  do  care  about  is  the  sons  of  conclusions  that  are  supported.  Among  the  possibili¬ 
ties  are:  a)  'X  means  Y’,  b)  'X  doesn’t  mean  Y’,  c)  ’X  includes  Y’,  and  d)  ’X  is  simi¬ 
lar  to  Y’.  Plausible  examples  of  each  of  these  putative  inferencing  axioms  are  a)  Idll 
’means’  cause  to  die,  b)  kill  ’doesn’t  mean’  injure,  c)  kill  ’includes’  murder,  d)  con¬ 
vince  ’is  similar  to’  persuade. 

4.  What  finitary  assumptions  are  made.  The  stability  of  a  system  under  different 
test  situations  may  cause  us  to  exclude  certain  kinds  of  sentences  or  dialogues  if 
queries  are  not  guaranteed  to  terminate  in  finite  time  or  with  available  storage 
(workspace) . 

5.  What  semantic  primitives  are  assumed.  Systems  may  or  may  not  invest  in  such 
notions  as  "locative,  temporal,  purposive,  agent,  goal,"  and  the  like.  For  those  that  do, 
we  might  expect  to  test  quite  straightforwardly  for  such  distinctions  as  the  differing 
"roles"  that  Bill  and  algebra  play  in  John  taught  Bill  and  John  taught  algebra,  as  well 
as,  of  course,  John  taught  Bill  algebra.  For  those  systems  that  do  not  have  "functional" 
primitives,  we  would  ask  about  the  existence  of  equivalent  notions  and/or  mechanisms. 

6.  Whether  the  semantic  representations  are  intended  to  be  partial  or  complete 
descriptions  of  sentences  (propositions*  discourses,  etc.)  and  whether  the  represen¬ 
tations  support  both  full  and  elliptical  constructions.  The  questions  at  issue  here 
involve  the  representation  of  sentences  like  John  is  reading  versus  John  is  sleeping. 
For  the  first,  it  is  "understood"  that  John  is  reading  something,  while  for  the  second  it 
is  equally  well  "understood"  that  there  is  no  something  that  John  is  sleeping.  The  ade¬ 
quacy  of  semantic  representations  is  clearly  important  to  assessing  the  capabilities  of 
an  system,  although  it  is  not  obvious  how  this  should  be  gotten  at  in  a  test  suite 
that  one  would  like  to  think  is  fainninded. 

7.  What  kind  of  well-formedness  conditions  apply  to  the  semantic  representa¬ 
tions.  While  double  agents  may  be  found  in  the  spy  business,  we  are  reasonably 
confident  that  they  do  not  occur  as  independent  complements  of  a  verb.  Moreover,  we 
are  also  reasonably  sure  that  the  limit  on  direct  complements  is  three  (syntactically,  the 
subject,  object,  and  indirect  object).  We  should,  therefore,  test  an  NLP  system  to  see 
whether  the  semantic  representations  exclude  impossible  semantic  structures. 

8.  Whether  an  arbitrary  or  non-arbitrary  mapping  is  assumed  between  syntactic 
and  semantic  representations.  The  issues  here  involve  the  granularity  and  complete¬ 
ness  of  the  representations.  Do  the  representations  account  for  every  morpheme  in  the 
input  sentence?  For  every  meaningful  difference  in  word  order?  Le.,  do  the  represen- 
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tations  purport  to  be  complete?  If  no  current  system  does,  it  may  be  futile  to  develop 
test  suites  to  evaluate  such  nicedes  as  the  difference  in  meaning  of  They  stopped  to 
search  for  survivors  and  They  stopped  searching  for  survivors  let  alone  the  subtle 
differences  between  Reagan  sent  Iran  a  message  and  Reagan  sent  a  message  to  Iran. 
(This  is  not  to  say  that  it  wouldn’t  be  quite  appropriate  to  test  a  system’s  ability  to 
parse  such  sentences.)  From  this  perrpective  it  is  clear  that  current  state-of-the-art  sys¬ 
tems  capabilides  will  both  determine  and  limit  the  test  suites. 

9.  The  kind(s)  of  knowledge  on  which  pragmatic  interpretations  are  based.  Is  the 
system  knowledge  in  lexicons  and  hierarchies,  or  is  general  encyclopedic  knowledge 
available?  The  kinds  of  test  suites  one  will  develop  depends  very  much  on  the  scope 
and  the  extent  of  the  knowledge  bases  available  for  reasoning.  Then,  too,  it  is  point¬ 
less  to  evaluate  system  performance  over  metaphors  and  idioms  if  no  system  has  any 
general  way  of  dealing  with  these  phenomena. 

10.  The  types  oi  inferencing  that  are  supported.  Especially  of  interest  is  whether  a 
system  supports  the  Gricean  maxims  that  involve  providing  informadon  that  is  close 
(or  relevant)  to  a  query.  For  instance,  a  system’s  knowledge  bases  (given  inidally  or 
acquired  on  the  fly  via  new  text  or  message  input)  may  contain  the  informadon  that 
Max  died.  To  the  query  Did  John  kill  Max?  can  the  system  conclude  that  the  informa- 
don  it  has  is  likely  to  be  relevant  to  the  query?  If  a  system  "knows"  that  John  con¬ 
vinced  Bill  of  something  or  other,  is  it  capable  of  noting  the  relevance  of  the  query 
Did  John  persuade  Bill?. 

11.  Whether  reasoning  about  classes  (i.e.  higher-order  rules)  is  available.  The  is¬ 
sues  here  involve  the  abilides  of  a  system  to  recognize  and  exploit  equivalent  and  re¬ 
lated  phenomena  by  virtue  of  having  mechanisms  to  relate  semandc  representadons. 
For  example,  assume  that  all  cleft  sentences  have  unclefted  venions.  Then,  a  system 
might  reasonably  be  expected  to  recognize  that  It  was  John  who  closed  the  door  is  a 
"good"  fit  to  the  query  Did  John  close  the  door?  and  noight,  then,  be  expected  to 
respond  with  ’yes’  if  the  first  sentence  is  known  to  be  true. 

12.  Whether  the  full  range  of  natural  language  quantification  is  handled.  If  no 
NLP  system  can  deal  with  the  full  range  of  natural  language  quandficadon,  then  it  is 
clearly  pointless  to  include  test  suites  that  contain  such  sentences  as  few  tigers  are 
tame  and  John  frequently  walks  the  dog,  expecting  one  or  more  systems  to  reason  over 
such  sentences  given  the  informadon  that  twelve  dgers  are  known  to  be  tame  or  that 
John  walks  his  dog  three  times  a  week  except  when  he’s  out  of  town.  Qearly,  it 
would  be  bener  to  sdck  with  All  men  are  mortal  and  the  like  if  general  quandficadon 
strategies  are  not  included  in  the  capabilides  of  an  NLP  system. 

The  above  lisdng  is  not  intended  to  be  anything  even  remotely  constniable  as  a 
definitive  statement  about  the  range  of  things  to  be  tested  and  evaluated  in  the  seman¬ 
tics  and  pragmatics  areas.  Instead,  it  is  intended  to  foster  a  discussion  of  the  prelim¬ 
inaries  that  must  be  dealt  with  in  order  to  construct  glass-box  test  suites  and  metrics 
suitable  for  ranking  a  system’s  semantics  and  pragmatics  analysis  capabilides  at  any 
given  stage  of  its  development  and  to  rank  the  performance  of  one  system  against 
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another.  I  do  want  to  emphasize  that  it  seems  to  me  quite  hopeless  to  entertain  any 
notion  of  developing  a  universal  representation  scheme  into  which  the  representations 
of  particular  NLP  systems  might  be  translated.  The  aims,  scope,  and  capabilities  of 
NLP  systems  are  just  too  varied  for  that  What  does  seem  possible,  however,  is  to 
develop  test  suites  and  evaluation  metrics  that  follow  more  or  less  directly  tom  such 
considerations  as  those  given  above.  By  keeping  the  evaluation  at  the  higher,  func¬ 
tional  level,  it  seems  to  me  that  reliable  and  defensible  semantic  and  pragmatic  evalua¬ 
tion  metrics  can  be  fashioned.  Furthermore,  test  suites  and  metrics  will  not  be  tied  to 
specific  domains  or  to  specific  application  areas,  but  will  measure  general  system  capa¬ 
bilities,  as  desired. 
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0.  Introduction 

The  Sundial  (Speech  UNdersianding  and  DIALogue)  project  is  presently  one  of  the 
largest  speech  and  language  technology  projects  in  Europe^  It  involves  researchers 
from  five  countries  in  170  person  years  of  effort  spread  across  five  years.  The  aim  of 
the  project  is  to  develop  prototype  spoken  dialogue  systems  for  limited  domains  (-20(X) 
words)  for  each  of  English,  French,  German  and  Italian  (Peckham,  1990). 

As  far  as  possible,  common  approaches  to  design  and  implementation  have  been 
adopted,  with  some  software  modules  being  used  in  all  of  the  prototype  systems.  For 
example,  an  identical  dialogue  manager  module  is  used  in  each  of  the  prototypes 
(although  naturally  each  prototype  uses  its  own  language-particular  declarative 
knowledge  base)  (Bilange  1991;  McGlashan  et  al.  1991).  Figure  1  shows  the  overall 
architecture  of  the  four  Sundial  systems. 

The  task  of  evaluating  a  system  of  this  complexity  is  very  difficult  indeed,  and  it  is 
a  task  about  which  very  little  is  currently  known  or  understood.  In  Sundial,  we  have 
tried  to  find  a  balance  between  principle  and  practicality.  We  believe  that  there  is  a  close 
connection  between  the  processes  of  system  design  and  system  evaluation.  In  Section 
1, 1  outline  the  Sundial  approach  to  design.  In  Section  2, 1  go  on  to  show  how  data 
collected  in  the  design  phase  can  be  used  to  good  effect  in  objective  system  evaluation. 
Conclusions  are  drawn  in  Section  3,  where  I  offer  a  key  metric  for  dialogue  systems. 
The  main  concern  of  the  paper  is  with  the  evaluation  of  the  dialogue  management 
performance  of  the  system,  although  this  is  only  one  of  several  foci  for  evaluation  in 
the  Sundial  project. 


^Thc  work  reported  here  was  supponed  in  part  by  the  Commission  of  the  European  Communities 
as  part  of  ESPRIT  project  P2218,  Sundial.  Partners  in  the  project  are  Cap  Gemini  Innovation,  CNET, 
IRISA  (Franco).  Daimlor-Benx,  Siemens,  University  of  Erlangen  (Germany),  CSELT,  Sarite)  (Italy. 
Logics  Cambridge,  University  of  Surrey  (U.K.).  In  addition,  Politecnico  di  Torino  (Italy)  is  an 
associate  partner  and  Infovox  (Sweden)y^i^ubconinicu)r. 


Figure  1:  architecture  of  the  Sundial  system 


1.  CorpuS'bascd  design 

Design  of  the  Sundial  system  has  been  driven  to  a  significant  extent  by  the  results  of 
corpus  analysis.  Two  different,  tasks  are  being  implemented:  flight  enquiries  and 
reservations  (English  and  French);  and  train  timetable  enquiries  (German  and  Italian). 
Initial  corpora  of  human*human  dialogues  were  collected  using  existing  telephone 
information  services.  A  representative  sample  of  dialogues  was  used  to  create  scenarios 
for  Wizard  of  Oz  (WOZ)  simulation  experiments  in  which  subjects  believed  they  were 
talking  with  an  operational  system  when,  in  fact,  they  were  really  conversing  with  a 
human  experimenter  whose  voice  had  been  made  to  sound  synthetic  by  filtering 
through  a  vocoder  (Ponamale  et  al.  1990,  Fraser  and  Gilbert  1991a).  The  English 
Sundial  team  collected  a  corpus  of  100  human-human  dialogues  (the  H-H  corpus)  and 
another  corpus  of  100  simulated  human-computer  dialogues  (the  H-C  corpus).  Since 
the  dialogues  in  the  H-C  corpus  were  driven  by  scenarios  derived  from  the  H-H 
corpus,  the  two  corpora  were  readily  comparable. 

The  WOZ  simulations  revealed  a  numer  of  important  differences  between  human- 
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human  dialogues  and  human-computer  dialogues.  For  example: 

•  fewer  words  (tokens)  arc  uttered  in  the  H-C  corpus; 

•  fewer  disdnct  word  forms  (types)  are  used  in  the  H-C  corpus; 

•  ellipsis  is  vinually  non-existent  in  the  H-C  corpus,  although  it  is  fairly  common  in 
the  H-H  corpus; 

•  there  is  a  tendency 'to  avoid  the  use  of  some  synuctic  constructions  in  the  H-C 
corpus  (e.g.  there  are  3-4  times  as  many  relative  clauses  in  the  H-H  corpus  as  there 
are  in  the  H-C  corpus); 

•  there  are  roughly  the  same  number  of  instances  of  overlapping  talk  (when  both 
speakers  are  talking  simultaneously)  in  the  whole  H-C  corpus  as  there  are  in  one 
average  dialogue  in  the  H-H  corpus  (for  further  details  see  Fraser  and  Gilben 
1991b). 

On  the  whole,  the  findings  of  the  simulations  are  very  encouraging,  since  they  indicate 
that  when  speaking  to  a  computer,  speakers  adapt  their  linguistic  behaviour  in  ways 
which  simplify  the  task  of  utterance  interpretation.  Some  of  the  simulation  findings  are 
precisely  those  which  might  have  been  predicted,  but  others  are  much  less  likely  to 
have  been  anticipated.  An  important  principle  underlying  the  Sundial  project  is  that 
system  design  should  be  firmly  rooted  in  the  empirical  evidence  of  the  simulation 
corpora.  How  best  to  make  the  transition  from  data  to  design  is  an  open  research 
question,  and  one  which  deserves  much  more  attention  than  it  is  presently  receiving  in 
the  NLP  community  (Fraser  et  ai  fonhcoming).  Clearly,  a  speech  understanding 
system  which  modelled  the  exact  behaviour  found  in  the  corpus  and  nothing  else, 
would  not  be  useful. 

The  approach  favoured  in  the  Sundial  project  is  to  make  abstract  descriptions  of  the 
corpus,  and  it  is  these  descriptions  which  are  modeled.  Descriptions  are  made  at  a 
number  of  different  levels.  For  example,  at  the  level  of  word  recognition  the  corpus  is 
not  nearly  large  enough  to  train  the  necessary  speaker-independent  word  recognition 
subsystem.  However,  it  does  provide  an  abstract  description  of  the  words  which  have 
to  be  recognized.  These  words  can  then  be  collected  elsewhere  and  used  in  training.  At 
a  higher  level,  the  corpus  is  tagged  for  word  classes,  syntax  and  sennandcs.  Moving  up 
a  level,  the  largest  phrases  (usually  sentences,  since  ellipsis  is  virtually  non-existent) 
are  assigned  a  dialogue  act  label,  indicating  their  function  in  the  dialogue  (e.g.  as 
statements,  questions,  answers,  etc).  A  dialogue  grammar  can  then  be  built.  This 
differs  from  a  sentence  grammar  in  having  to  take  account  of  the  turn  structure  of  the 
dialogue  (i.e.  an  imponant  fact  about  question-answer  exchanges  is  that  answers  are 
uttered  by  the  speaker  who  did  not  utter  the  question).  At  another  level,  the  unfolding 
goal  structure  of  each  dialogue  is  described.  Dependencies  between  descripdons  at 
different  levels  result  in  the  producrion  of  equivalence  classes.  For  example,  a  particular 
dialogue  act  may  be  realized  by  a  number  of  different  types  of  sentence.  In  this  way 
generailizations  over  the  corpus  are  produced. 
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The  Sundial  system  has  been  designed  to  model  the  behaviour  found  at  each  level 
of  analysis  and  to  respect  dependencies  between  levels.  Each  of  the  modules  shown  in 
Figure  1  has  been  implemented  (except  that  the  application  database  is  cunently  being 
simulated  within  the  Dialogue  Manager).  The  first  fully-integrated  version  of  the 
Sundial  system  will  be  delivered  in  July  1991. 

2.  Corpus-based  evaluation 

The  simulated  H-C  coipus  also  plays  a  central  role  in  evaluation  of  the  Sundial  system. 
A  first  approach  is  to  use  the  utterances  of  subjects  in  simulation  experiments  as  a  soipt 
for  black  box  evaluations  of  the  system  as  a  whole.  Evaluation  at  this  level  is  unlikely 
to  be  very  informative,  since  there  may  be  many  different  'correct*  ways  of  responding 
to  an  utterance  at  a  given  point  in  a  dialogue,  only  some  of  which  will  be  represented  in 
the  simulation  corpus.  However,  since  the  corpus  is  tagged  with  labels  representing 
relatively  'deep'  things  such  as  dialogue  acts  and  task  goals  (as  well  as  the  more 
'surface'  oriented  lexical  and  syntactic  tags),  the  dialogue  management  performance  of 
the  system  can  be  evaluated  separately  from  other  functions  of  the  system  such  as 
linguistic  processing  or  message  generation. 

In  general,  the  boundaries  between  software  modules  also  correspond  to 
boundaries  between  levels  of  corpus  description.  The  Acoustic-Phonetic  Decoding 
module  passes  the  Linguistic  Processing  module  a  word  lanice  or  graph  which  ideally 
includes  the  words  which  were  actually  uttered.  The  Linguistic  Processing  module 
selects  the  'best'  string  (on  the  basis  of  recognition  scores  and  grammatical  well- 
formedness)  and  annotates  it  with  lexical,  syntactic,  and  semantic  markers.  This  serves 
as  input  to  the  Dialogue  Manager  module  which  further  annoutes  the  string  with 
dialogue  act  and  task  goal  labels.  At  this  point,  the  annotation  process  ceases  for  the 
input  string.  Instead,  the  Dialogue  Manager  introduces  a  new  dialogue  act/goal  object 
corresponding  to  the  systems's  next  utterance.  This  is  passed  to  the  Message 
Generation  module  which  maps  the  deep  representation  onto  a  surface  representation 
realizing  semantic,  syntactic,  and  lexical  features. 

A  glass  box  evaluation  can  be  used  to  monitor  the  internal  interfaces  of  the 
Sundial  system.  Each  time  a  module  outputs  a  representation,  the  new  annotations 
added  by  that  module  can  be  checked  automatically  against  those  in  the  simulation 
CC9US.  Since  no  annotations  of  a  given  type  are  ever  added  to  a  string  by  more  than 
one  module,  each  module  has  absolute  responsibility  for  those  levels  at  which  it  adds 
annotations.  The  approach  of  monitoring  inputs  and  outputs  to  system  modules  is 
particularly  applicable  to  the  Linguistic  Processing  and  Message  Generation  modules. 
The  Dialogue  Manager  is  rather  different  since  its  output  is  not  a  modified  version  of  its 
input  Rather,  its  output  is  a  response  to  its  input.  To  evaluate  the  Dialogue  Module  it  is 
necessary  to  look  inside  it  to  see  whether  the  dialogue  act  and  task  goal  labels  match 
those  found  in  the  corpus.  It  is  also  necessary  to  ensure  that  user  utterance-system 
re.sponse  dialogue  act/goal  label  pairings  for  user  input  and  system  response  belong  to 
the  set  of  legal  pairings  derived  from  the  corpus. 
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3.  Conclusion 

Dialogue  understanding  systems  are  difficult  to  evaluate  because  it  is  hard  to  define 
what  constitutes  a  well-formed  dialogue.  {Spoken  dialogue  systems  add  an  extra  layer 
of  indetenninacy  which  further  complicates  them).  In  this  paper  I  have  suggested  that 
dialogue  understanding  systems  should  be  evaluated  in  respect  of  a  corpus  which 
includes  a  range  of  dialogue  phenomena.  The  corpus  should  be  annouted  at  a  number 
of  different  levels  and  the  capacity  of  the  system  to  generate  annotadons  which  conform 
to  a  generalization  of  those  found  in  the  corpus  should  be  measured.  This  process 
should  be  repeated  with  corpora  other  than  the  one  used  during  the  design  phase. 

How  is  the  performance  of  the  system  at  different  levels  to  be  harmonized  for  the 
purposes  of  overall  system  evaluation?  I  suggest  that  the  highest  levels  •  those 
involving  dialogue  acts  and  task  goals  •  are  all-important,  and  provide  the  principal 
index  of  the  value  and  effectiveness  of  the  system. 

A  word  recognidon  system  is  fairly  simple  to  evaluate  directly  in  terms  of  its  abiliry 
recognize  and  interpret  its  input  (i.e.  to  map  it  into  a  different  form);  so  is  a  parser  and  a 
message  generator.  The  main  task  of  a  dialogue  manager,  however,  is  not  so  much  to 
recognize  utterances  as  to  respond  intelligently  to  them.  There  should  be  a  close 
match  between  the  input  and  the  output  of  a  word  recognidon  system,  but  here  should 
be  an  unfolding  progression  from  a  dialogue  manager's  input  to  its  output.  By 
developing  an  abstract  multi-level  descripdon  of  a  corpus  of  simulation  dialogues  it  is 
possible  to  define  (i)  a  set  of  correspondences  between  different  levels  of  analysis  for  a 
given  string;  and  (ii)  a  set  of  well-formed  progression  paths  through  dialogues, 
expressed  at  the  levels  of  dialogue  acts  and  task  goals,  by  which  a  dialogue  manager 
can  be  evaluated. 

Failure  at  the  lower  levels,  such  as  word  recognition,  parsing,  and  text  generation, 
is  ultiinately  not  serious  so  long  fs  a  well-formed  progression  path  leads  from  dialogue 
opening  and  goal  formation  all  the  way  through  to  goal  satisfaction  and  dialogue 
closing.  Along  the  way,  there  may  be  any  number  of  insertion  sequences  for  purposes 
of  confirmation,  clarification,  and  the  repair  of  dialogue  failures  (spoken  dialogues 
typically  include  many  more  of  these  than  written  dialogues),  but  the  dialogue  system 
can  be  said  to  have  succeeded  in  its  primary  task  if  the  end  of  a  progression  path  is 
eventually  reached.  Up  to  a  point,  the  number  of  insertion  sequences  is  another  metric 
which  can  be  used  in  the  evaluation  of  dialogue  systems:  the  smaller  the  number  of 
insertion  sequences,  the  better  the  dialogue  system.  However,  the  ability  to  repair 
dialogue  failures  and  the  ability  to  recognize  that  confirmation  or  clarification  may  be 
necessary  are  vital  skills  in  dialogue.  Circumstances  in  which  there  is  need  for 
confirmation,  clarification,  or  repair  crop  up  routinely  in  ordinary  dialogues. 

In  summary,  the  primary  criterion  for  the  evaluation  of  a  dialogue  system  should  be 
its  ability  to  reach  the  end  of  progression  paths  consistently.  A  secondary  criterion  is 
that  there  should  be  no  more  insertion  sequences  than  are  genuinely  required. 
Unpacking  this  criterion  leads  to  more  specific  direct  evaluation  criteria  for  the  lower 
level  processes,  as  outlined  above.  An  effective  dialogue  understanding  system  is  not 
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one  which  never  makes  any  mistakes,  but  rather  it  is  one  which  always  manages  to 

recover  from  its  mistakes. 
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1.  INTRODUCTION 

This  paper  examines  the  evaluation  of  natural  language  understanding  within  a  data 
extraction  system.  Our  central  claim  is  that  NLP  capabilities  and  "end-to-end"  extraction 
success  can  and  should  be  evaluated  separately.  (This  observation  parallels  insights  in 
Sundheim  l‘./89  and  Palmer  and  Finin  1990.)  The  contribution  of  this  paper  is  to  illustrate 
"le.s.sons  learned":  we  have  implemented  such  an  evaluation  system  in  our  work  at  SRA,  and 
will  present  .somt  of  the  attendant  pitfalls  and  triumphs. 

Successful  evaluation  requires,  first,  that  two  seemingly  peripheral  issues  be 
addressed:  test  samples  and  the  categorization  of  errors. 

•  Representative  test  samples 

The  test  can  serve  as  a  predictor  of  how  the  system  will  fare  only  insofar  as  the  test 
materials  are  representative  of  the  actual  corpus  the  system  must  handle. 

•  Categorization  of  Errors 

Results  depend  not  only  on  capabilities,  but  on  the  correctness  of  the  supporting 
data.  That  is,  does  a  particular  capability  succeed,  assuming  the  data  in  the  test  sample  are 
correct?  Capabilities  cannot  be  tested  without  using  data.  One  way  to  acknowledge  the 
contribution  of  data  to  the  assessment  of  capabilities  is  to  count  data  errors  separately. 


2.  "BLACK  BOX"  VS.  "GLASS  BOX" 

"Black  box"  evaluation  focu.ses  on  input  and  output  rather  than  on  methods.  It  refers 
to  evaluation  of  a  full  sy.stem,  in  this  case  from  text  to  database  fill.  Black  box  evaluation 
has  the  advantage  of  providing  quantitative  results,  but,  because  it  offers  no  insight  into 
what  caused  the  failures  and  why,  developers  need  further  information  if  they  are  to  use 
testing  results  to  guide  their  work.  "Glass  box"  evaluation  looks  inside  the  system  to  judge 
specific  techniques  and  algorithms,  but  its  very  comple.xity  makes  it  difficult  to  generate  or 
interpret  quantitative  results. 

In  general,  black  box  testing  is  appropriate  only  for  a  completed  system.  It  gives  only 
an  end-to-end  asse.ssment,  making  it  difficult  to  pinpoint  areas  of  strength  or  weakness.  It 
may  even  make  it  difficult  to  tell  whether  the  system  is  excellent  with  only  one  significant 
flaw,  or  good  but  slightly  flawed  throughout.  The  culprit  here  is  the  cascade  effect,  as 


Greensiein  and  Blejer  1990  note:  The  overall  score  can  be  no  better  than  the  worst  score 
of  any  single  component,  or  module.  If  every  module  were  80%  accurate,  the  final  results 
would  be  the  same  as  if  four  modules  were  100%  accurate  and  one  was  33%  accurate. 
Though  the  difference  in  individur’  module  performance  may  not  be  important  to  the  end 
user,  it  is  crucial  to  the  developers. 

It  is  not  necessary  to  abandon  black  box  testing  entirely,  and  with  it  the  quantitative 
measures  that  make  it  attractive:  it  is  possible  to  do  black  box  testing  on  individual  modules. 
That  is,  quantitative  results  can  be  obtained  using  fixed  input,  and  the  core  of  the  black  box 
methodology,  "Look  at  the  input  and  output,  not  how  it’s  processed,"  can  be  retained. 
Performing  module-level  black  box  testing  helps  to  overcome  the  major  disadvantage  of  end- 
to-end  black  box  evaluation,  since  module-level  output  can  provide  useful  feedback  to 
developers. 

Module-level  black  box  evaluation  is  possible  only  when  the  system  is  modularly 
designed  and  the  interfaces  between  modules  are  clearly  defined.  This  idea  is  not  new: 
black  box  evaluation  of  a  single  component  is  proposed  in  Palmer  and  Finin  1990  as  a  "glass 
box  methodology"  of  choice  (glass  box  in  that  it  offers  a  look  inside  the  system,  but  black 
box  in  that  it  does  not  look  inside  each  module).  They  do  not  really  discuss  implementation, 
however.  SRA’s  experience  in  evaluation  highlights  some  of  the  successes  and  pitfalls  in 
module-level  black  box  testing.  The  modules,  evaluation  criteria  and  procedures  are 
described  below. 


3.  TESTING  AT  THE  MODULE  LEVEL 
3.1  Setting  up  the  Procedures 

SRA’s  natural  language  understanding  system  has  separate  modules  for  preprocessing 
(morphological  analysis,  lexicon  lookup,  multiword  phrase  handling);  syntactic  analysis 
(phrase  structure  grammar  and  bottom-up  parser);  semantic  interpretation  (compositional 
and  frame-based);  and  discourse  analysis  (reference  resolution  and  other  intersentential 
links).  We  have  designed  and  implemented  a  te.sting  protocol  that  evaluates  the  output  of 
each  module  separately,  as  well  as  .standard  end-to-end  black  box  testing. 

Performance  of  the  testing  procedures  necessitated  some  changes  to  our  original  test 
design.  Tho.se  changes  are  the  topic  of  the  remainder  of  this  section.  The  testing  cycle  for 
both  types  of  test,  under  the  original  design,  was  to  follow  these  steps: 

1)  Define  the  correct  input 

2)  Define  the  correct  output 

3)  Process  correct  input  through  system  under  test  to  produce  the  actual  output 

4)  Compare  the  actual  output  with  the  correct  output  to  calculate  a  score 

5)  Analyze  and  evaluate  the  scores 
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6)  Generate  design  and  implementation  tasks  to  correct  errors 

7)  Execute  those  tasks,  modifying  the  system  under  test 

8)  Return  to  step  3,  above,  and  repeat 

This  cycle  is  represented  schematically  in  Figure  1,  below.  The  goal  is  to  produce 
a  series  of  scores  over  time  that  can  be  compared.  If  the  only  piece  of  the  system  that 
chanoes  is  the  svstem  under  test,  then  the  change  in  the  scores  should  reflect  the  change  in 
the  capabilities  of  the  system. 


Certain  statements  follow  from  the  above  underlined  phrase: 

1)  The  correct  input  is  definable,  can  be  provided,  and  is  fixed  throughout 
development 

2)  The  same  is  true  for  the  correct  output 

3)  The  scoring  method  is  complete  and  fixed;  combined  with  1  and  2  above,  this 
means  that  a  perfect  score  can  be  calculated  at  the  beginning  of  testing 

The  reality  of  our  development  was  that  we  violated  all  three  of  the  above  assumptions,  in 
whole  or  in  part.  None  of  these  negates  the  validity  of  our  results,  but  they  did  serve  to 
complicate  our  testing  and  evaluation. 

The  testing  cycle  with  which  we  began  to  obtain  our  baseline  results  was: 

la)  Define  input  using  English  words,  phrases,  and  sentences  to  test  capabilities 
that  had  been  coded,  or  would  be  coded,  into  the  system  under  test 

lb)  Use  existing  modules  to  process  English  input  into  properly  structured  correct 
input  for  the  system  under  test 

2)  Don’t  explicitly  create  a  file  of  correct  output 

3)  Process  correct  input  through  system,  under  test  to  produce  actual  output 

4a)  Analyze  the  actual  output,  marking  correct  output  explicitly  for  future  tests, 
and  implicitly  define  remaining  correct  output  by  describing  problems  with 
incorrect  output 

4b)  Score  the  actual  output,  ba.sed  on  analysis;  enhance  the  scoring  methodology 
as  necessary  to  account  for  all  the  observed  phenomena 

5)  Analyze  and  evaluate  the  scores 


111 


Figure  1:  Test  Methodology  As  Originally  Presented 


0)  Generate  design  and  implementation  tasks  to  correct  errors;  coordinate  with 
on-going  tasks  already  specified  as  scheduled  for  execution  during  the  project 

7)  Execute  those  tasks,  modifying  the  system  under  test 


iS)  Return  to  step  la,  above,  and  repeat 
This  is  shown  schematically  in  Figure  2. 


The  changes  to  the  process  take  into  account  several  aspects  of  actually  developing 
the  prototype  system  (especially  the  natural  language  understanding  modules): 

Defining  the  correct  input  and  output  for  each  module  Ls_extremelv_difficult. 
The  modules  are  manipulating  large  and  complex  data  structures.  The  only 
tools  we  have  to  generate  the  structures  are  the  other  modules  of  the  system, 
which  are  themselves  still  under  development.  Therefore  it  is  not  always 
possible  to  get  from  the  English  input  to  the  correct  input  for  the  module, 
making  it  difficult  to  test  certain  parts  of  a  module  until  the  other  modules  on 
which  it  depends  arc  working  properly.  The  ideal  procedure  also  assumes 
that  the  interfaces  to  the  sy.stems  under  te.st  are  fixed.  The  project  described 
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here  is  a  prototype,  so  this  assumption  is  false  by  definition. 

It  is  particularly  difficult  to  design  the  correct  input  and  output  for  capabilities 
that  haven’t  been  designed  yet.  This  is  in  fact  an  early  part  of  the  design 
process.  It  is  therefore  reasonable  to  provide  test  cases  only  for  code  that  is 
implemented.  This  makes  it  difficult  to  compare  scores  from  one  test  to  the 
ne.\t. 

>  The  scoring  method  should  he  expected  to  evolve,  and  needs  to  he  customized 
for  each  module.  Useful  scoring  data  must  contain  more  information  than 
"wrong"  or  "missing"  or  "right  for  the  wrong  reason."  As  we  learned  what 
kinds  of  errors  needed  to  be  distinguished  in  each  module  in  order  to  provide 
useful  feedback  for  development,  we  refined  the  scoring  methods.  This  leads 
to  good  results  on  the  methodology,  but  inconsistencies  in  the  scores 
them.selves  -  the  results  are  often  so  voluminous  that  it  is  not  worth  the  time 
it  would  take  to  go  back  and  apply  an  insight  gained  in  mid-scoring  to 
previously  scored  output. 

.  The  tasks  generated  to  fix  the  errors  must  he  merged  into  the  workplan  with 
the  tasks  already  identified  in  the  project  workplan.  In  a  purely  testing-driven 
methodology,  the  only  ta.sks  to  be  done  would  be  those  identified  by  test 
errors.  One  would  then  e.xpect  to  complete  the  implementation  of  those  tasks 
before  the  next  round  of  testing.  Work  on  these  tasks  went  on  in  parallel  with 
testing.  As  tasks  were  defined  by  testing  analysis,  we  determined  if  any  of  the 
workplan  tasks  superseded  them.  If  not,  we  adjusted  our  workplan  to 
accommodate  these  "new"  tasks.  This  created  certain  scheduling  problems, 
noted  here  only  to  suggest  how  testing  affects  project  plartning  anil  progress. 

Because  the  input  and  scoring  methods  vary  over  time,  comparing  scores  over  time 
is  not  as  meaningful  as  we  would  like.  An  approach  to  this  problem,  if  the  interfaces  to  the 
modules  remain  fixed,  is  to  run  the  "current"  input  on  both  the  current  and  previous  versions 
of  the  modules.  This  yields  more  comparable  results,  at  the  expense  of  an  exponential 
number  of  scoring  tasks. 


32  Testing  is  Time-Consuming 

Selecting  the  input  correctly  depends  on  several  criteria.  Of  course,  it  should  be 
representative  of  the  domain.  The  set  should  be  small  enough  so  that  evaluation  does  not 
supplant  system  development.  We  intended  to  run  the  tests  every  two  months,  but  the 
process  was  .so  time-consuming  that  we  would  recommend  tests  every  six  months  instead. 

We  developed  tests  and  testing  procedures  and  ran  three  rounds  of  formal  testing. 
Specifically,  we  devoted  time  to  these  tasks: 
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•  E^tablish  test  suites  for  all  four  modules  (Preprocessing.  Syntactic  Analysis.  Semantic 
Interpretation,  and  Discourse  Analysis) 

•  Conduct  formal  tests  - 

•  Analyze  test  results  to  determine  the  cause  of  the  error 

»  Catalogae,  assess  and  route  the  errors  to  the  appropriate  person  to  be  fixed 

Hours  spent  on  testing  as  percentage  of  overall  development  and  testing  time:  13% 

Most  important,  as  noted  in  the  previous  section,  every  attempt  should  be  made  to 
minimize  dependence  on  other  modules.  The  effect  of  excessive  interdependence  is 
devastating  in  terms  of  time.  If  the  syntax  test  includes  items  that  do  not  correctly 
preproce.ss,  then  the  input  to  the  syntax  test  will  have  to  contain  hand-generated 
p^eproce^sinL  output,  a  time-consuming  procedure.  Worse,  if  the  semantics  test  uses  items 
that  do  not  parse  correctly,  many  hours  will  be  devoted  to  hand-generating  output  to  mimic 
correct  syntactic  results.  Di.scourse  depends  on  correct  *‘f.sults  from  all  three  previous 
modules,  or  hand-generated  input.  We  originally  intendjAi  to  generate  correct  input  by 
hand,  but  the  process  proved  overwhelming.  Instead  we  it  commend  that  extreme  care  be 
taken  in  choosing  the  test  suites  so  they  test  capabilities  in  a  targeted  way:  embed  discourse 
tests  in  sentences  that  parse,  and  u.se  a  re.stricted  vocabulary  unle.ss  you  want  to  use  your 


Figure  2;  Test  Methodology  As  Actually  Executed 
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evaluation  time  to  add  data. 


To  give  specific  examples,  if  handling  appositives  correctly  is  an  issue,  then  there  is 
no  point  in  testing  the  semantics  of  appositives  if  they  don’t  yet  parse  correctly.  Some 
examples,  especially  in  semantics  and  discourse,  could  not  be  used  because  they  required 
earlier  results  that  could  not  be  generated  at  that  point  in  development. 

On  the  output  side,  specifying  expected  results  is  slow  and  painful,  particularly  for 
capabilities  still  in  the  design  phase  -  it  may  be  a  research  issue  even  to  design  the  optimal 
output.  We  addressed  this  issue  by  assigning  a  zero  score,  but  that  created  inconsistencies 
later,  when  the  capability  was  added.  Scoring  is  discussed  in  the  next  section. 


4.  SCORING 

To  retain  the  advantages  of  black  box  evaluation,  it  is  important  to  produce 
quantitative  results.  To  provide  useful  feedback  during  development,  qualitative  results  are 
necessary.  This  section  describes  how  we  used  scoring  to  address  both  of  those  objectives. 

4.1  Categorization  of  Errors 

Errors  were  each  categorized  as  one  of  two  types.  Type  1  and  Type  II:  Type  I  errors 
are  those  where  .some  or  all  of  the  correct  output  is  missing,  in  other  words,  the  system  did 
not  get  something  it  should  have.  Type  II  errors  are  those  where  an  incorrect  piece  of 
output  is  generated,  i.e.,  the  system  got  something  that  it  shouldn’t  have.  These  two  types 
of  errors  have  analogues  in  Information  Retrieval:  Type  I  errors  are  those  in  recall  (the 
system  did  not  retrieve  a  document  that  it  should  have)  and  Type  2  errors  are  those  in 
precision  (the  system  retrieved  a  document  that  it  shouldn’t  have).  This  type  of 
categorization  into  types  occurred  for  both  blackbox  and  gla.ssbox  evaluation. 

Errors  in  the  glassbox  evaluation  were  also  categorized  in  one  of  ten  categories: 

1)  Algorithm  Design  Incomplete 

2)  Algorithm  Design  Incorrect 

3)  Algorithm  Implementation  Absent 

4)  Algorithm  Implementation  Incomplete 

5)  Algorithm  Implementation  Incorrect 

6)  Data  Design  Incomplete 

7)  Data  Design  Incorrect 

8)  Data  Implementation  Absent 

9)  Data  Implementation  Incomplete 

10)  Data  Implementation  Incorrect 

These  categories  helped  to  determine  what  the  nature  was  of  errors  in  the  system.  Over  the 


115 


three  rounds  of  testing  that  occurred,  this  categorization  had  a  predictable  trend:  the  first 
round  saw  mostly  Algorithm  Design  Incomplete  Errors;  the  second  round  saw  fewer  of  these 
errors  but  an  increase  in  Data  Implementation  Errors;  this  was  followed  by  a  decrease  in 
errors  in  general  in  the  third  testing  round.  This  is  indicative  of  the  nature  of  system 
development;  at  first  algorithms  are  either  missing  or  incorrect;  when  they  are  developed, 
then  inadequacies  in  the  existing  data  are  revealed;  this  is  followed  by  resolution  of  those 
errors  and  thus  a  more  accurate  system. 

A2  Module-Level  Scoring  Methodology 

Each  module  that  was  tested  glass-box  had  to  have  the  testing  approach  described 
above  tailored  to  the  needs  and  representations  of  that  module.  The  following  paragraphs 
describe  how  these  were  scored. 

Preprocessing  scoring  Preprocessing  was  scored  using  an  all-or-nothing  approach. 
As  the  input  .suite  is  large  (over  650  separate  inputs),  it  seemed  that  1  point  per  correct 
answer  would  yield  a  meaningful  trend  over  time.  For  Preprocessing,  Type  I  errors 
represented  any  incorrect  or  missing  output.  Type  II  errors  catalogued  output  that  was  right 
for  the  wrong  reasons  (e.g..  Chuck  E.  Cheese  was  recognized  as  a  name  because  "cheese" 
wasn’t  in  the  lexicon,  thus  it  was  an  unknown  word,  thus  it  defaulted  to  a  name;  as  "cheese" 
is  pretty  common,  we  might  expect  it  to  be  in  a  proper  lexicon,  and  the  name  would  not  be 
recognized).  Being  the  first  module  in  the  natural  language  understanding  pipeline, 
Preprocessing  has  the  easiest  time  getting  correct  input  (though  it  was  not  100%  successful 
the  first  round,  as  it  relies  on  a  previous  module  to  tokenize  the  English,  and  there  were 
some  problems  with  that  module). 

Syntactic  Analysis  scoring  Syntax  errors  are  counted  by  assigning  one  point  for  each 
constituent  correctly  tagged,  designated  as  Type  I,  and  one  point  for  every  constituent 
correctly  attached,  designated  as  Type  II.  That  is,  a  Type  I  error  is  counted  for  each 
constituent  that  is  mislabelled,  and  a  Type  II  error  is  counted  for  each  constituent  that  is 
attached  to  the  wrong  place  in  the  parse  tree. 

Syntactic  Analysis  returns  all  possible  parses  that  make  sense  according  to  our 
.syntactic  and  semantic  constraints,  ordered  from  most  to  least  plausible.  In  general. 
Semantic  Interpretation  works  on  the  first  (or  top)  parse  in  the  list.  Therefore  it  seems  to 
make  sense  to  evaluate  how  often  the  parse  ranked  best  by  the  module  is  actually  the 
correct  par.se.  However,  this  makes  little  sense  for  scoring  purposes,  mostly  because  of  the 
term  "correct  parse."  If  the  top  parse  has  zero  errors,  then  it  is,  by  our  scoring,  the  correct 
par.se.  If  the  correct  parse  were  in  the  list  of  returned  parses,  but  not  the  top  parse,  then 
we  would  have  to  potentially  score  all  returned  parses  to  discover  the  one  with  zero  errors. 
Given  that  Syntax  may  return  a  large  number  of  parses,  this  is  not  practical  with  our  current 
technologx  .  .Another  assumption  underlying  such  a  measure  is  that  all  (or  at  least,  several) 
of  the  parses  returned  are  syntactically  correct,  but  one  is  to  be  preferred.  This  implies  that 
the  gram, mar  is  perfect,  but  the  weighting  scheme  is  not.  As  our  grammar  is  not  yet  perfect. 
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\ve  uould  like  lo  first  concentrate  on  errors  in  the  grammar.  .Afterwards,  we  can  isolate  the 
weighting  scheme,  and  design  it  to  return  the  correct  parse  as  the  first  parse  in  the  list 
returned.  Therefore,  though  we  considered  reporting  scores  for  both  first  parses  and  best 
parse.s.  in  the  end  we  only  reported  scores  for  first  parses. 

Semantic  Interpretation  scoring  A  Type  I  error  is  scored  for  slots  not  produced  that 
should  have  been  produced,  and  correct  values  inserted  in  wrong  slots.  Type  11  errors  are 
incorrect  slots,  and  wrong  values  in  correct  slots.  "Decisions"  that  Semantic  Interpretation 
must  make  were  scored,  including: 

•  mapping  to  thematic  roles 

•  sentential  modifier  mapping  (adjunct  handling) 

•  noun  phrase  modifier  handling 

•  modifiers  within  sentential  clau.ses  or  nominalizations 

•  correct  predicate  identification 

•  correct  modification  and  scoping  of  stacked  noun  phrases 

•  correct  updating  of  referenced  knowledgebase  objects 

Each  of  these  decisions  is  also  worth  a  point. 

Discourse  Analysis  scoring  Phenomena  are  more  easily  isolated  within  Discourse 
Analysis  than  in  other  modules.  Each  test  sentence  or  group  of  sentences  (the  Discourse 
Analysis  subsuite  has  groups  of  interrelated  sentences  to  test  intersentential  phenomena) 
generally  tests  one  capability.  This  capability  is  either  the  successful  resolution  of  a 
discourse  object  or  an  attribute  being  filled  in  an  object  through  discourse  phenomena.  We 
score  one  point  wl\en  the  resolution  or  attribution  succeeds,  and  zero  if  not.  Type  I  errors 
are  assigned  when  a  resolution  or  attribution  is  either  missed  or  gotten  wrong.  Type  II 
errors  are  assigned  when  a  reference  or  attribution  was  performed  when  it  shouldn’t  have 
been  (the  subsuite  has  "negative"  examples  as  well  as  "positive."  i.e.,  to  make  sure  that 
solutions  that  are  too  general  are  also  penalized). 

The  complexity  of  the  module-level  error  analysis  requires  that  it  be  done  by 
developers  familiar  with  the  system.  Less  detailed  categorization  serves  black  box  needs 
(quantitative  results)  but  does  not  show  what  needs  to  be  fixed  to  remedy  the  errors.  Even 
more  important,  a  developer  must  decide  which  corrections  will  yield  the  most  "bang  for  the 
buck,"  setting  sensible  priorities  for  error  remediation. 

When  the  capability  to  be  tested  is  entirely  absent,  a  score  of  zero  is,  of  course, 
appropriate.  But  it  is  not  always  clear  what  the  denominator  should  be;  in  some  sense  the 
solution  must  be  designed  in  order  to  count  how  many  points  will  be  assigned  to  a  correct 
solution.  We  dealt  with  this  problem  as  best  we  could,  by  revising  the  denominators  of 
previous  test  sessions  once  we  had  a  clearer  vision  of  the  ideal  solution. 
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5.  CONCLUSION 


We  have  examined  a  module-levet  evaluation  methodology  for  natural  language 
understanding,  illustrating  insights  gained  during  implementation.  It  is  possible  to  produce 
useful  quantiiatise  and  qualitative  results  using  such  a  methodology,  provided  certain  pitfalls 
are  avoided. 
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Malntitning  and  Enhanclnd«  NL  DBII8  Inttiftirt 

R.  E.  CuUlngford.  IBS.  Inc..  MOford.  CT 
M.  Craves.  Dept,  of  CS.  Universtty  of  Michigan.  Ann  Arbor.  Ml 

Introdnction 

The  problem  of  evaluation  of  natural  language  (ML)  technology  pre-suppoaes  a  foiytlon  to  a 
prior  problem:  that  of  creating  the  ML  system  In  the  first  place.  The  moat  problem 

In  this  development  process  is  tircumscrlblng  the  stem's  linguistic.  urid  pra^naUc 

capabilities.  This  problem  is  exacerbated  when  the  ^stem  being  evaluated  la  concurrently 
undergoing  .*nodin<;atlon:  the  process  of  enhancement  often  causes  breakagea  to  pie«cxlstlng 
capabilities.  This  presentation  will  describe  a  practical  attempt  to  grapple  with  the 
development  problem  in  the  context  of  a  commercial  Engliah*language  DBMS  access  tool 
called  EasyTalk.  A  partial  solution  is  embodied  in  the  laartcon  Malatatoance  FaclUty  (LMF).  a 
software  system  designed  to  document  the  interrelationships  among  the  words  and  phrases 
"understood"  by  the  tool  and  to  control  the  tool's  testing. 

CoipaS'Based  Testliig 

The  standard  method  of  documenting  and  testing  the  capability  of  a  NL  system  la  with  a  corpus 
of  sentences  which  the  system  should  process  correctly.  For  commercial  tystems.  thm 
typically  consist  of  several  thousand  queries  (for  example.  BBITs  FCCBMP  corpus  IRADC86D. 
Ruiinlng  the  whole  corpus  can  become  unwiddty.  and  thus  ft  msy  be  useful  to  break  the  corpus 
into  smaller,  more  manageable,  corpora. 

Several  factors  may  affect  the  marmer  to  sdiich  the  corpus  la  divided,  such  as  the  daaa  of  Nil 
used.  For  example,  testing  of  a  syntax«fiist  parser  (e.g.  IHEND78.  KAFL83))  nuy  lead  to 
spUttli^  the  corpus  to  emphaalce  syntactic  constltuenta.  while  a  more  aemanticalty  oriented 
parser  (e.g..  (BURT76,  BIRN81D  might  base  Its  dMaioo  on  scanantic  categories. 

The  proceaa  of  the  division  wlS  be  based  on  the  desired  granulaitty.  oeganiatlon.  and  coverage 
of  the  corpora.  Should  there  be  several  small  corpora,  or  relatively  few  lai]^  ones?  Should 
they  be  arranged  to  a  flat,  unstructured  set.  or  hierarchically  arranged?  Are  the  corpora  to  be 
in  some  sense  "complete",  say  over  a  sublanguage  (04^  IKinBTD,  or  relativcty  sparse  (te..  an  ad 
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hoc  coUectlofU.  Also,  is  the  corpus  to  contain  sentences  siilch  do  not  yet  woric.  L  e..  which 
represent  new  problems  to  be  studied  or  desired  enhancements? 

The  Lexicon  Maintenance  FactUty  described  here  houses  a  large  (•’2500)  teat  query  coQectlon. 
The  query  set  is  organized  around  the  ao*called  General  English  component  of  the  EasyTalk 
lexicon,  the  collection  of  words  and  phrases  supplied  with  the  qrstem  and  expected  to  have 
substantially  the  same  meaning  across  database  applications.  (The  Applicatlon«Speclllc  part 
of  the  lexicon  is  supplied  by  the  customer  at  setup  time.) 

We  have  divided  our  corpus  In  two  ways.  The  first  divialon  Is  bottom-up  via  the  lexlcorL  Each 
entry  in  the  lexicon  has  a  small  corpus  associated  with  it  to  teat  Its  functionality.  This  corpus 
documents  the  aerttences  that  were  used  In  the  design  of  the  lexical  entry  and  also  contains  a 
sufllcient  number  of  sentences  to  adequately  exercise  It  in  testing. 

Each  of  these  corpora  are  grouped  together  mto  tsro  different  hierarchies.  The  first  hierarchy  is 
conceptual  The  second  la  pl^alcal  This  la  useful  for  testing.  If  a  file  is  dumged.  it  is  eaty  to 
test  the  corpora  for  all  the  lexicon  entries  in  that  file.  More  seriously,  when  a  General  English 
synUctic/semantlc  primitive  changes,  the  hierarchy  allb'*>s  a  quick  computation  of  a  query 
test  set  representing  linguistic  constructions  which  are  probably  affected  by  the  change. 

The  second  division  of  the  corpus  is  top-down  via  lingulstic/databaae  Capabilities.  In  the 
DBMS  access  setting,  a  Capablllllty  Is  a  characteristic  database  interaction  (mediated  by  a 
query  in  a  DBMS-lnterface  language  such  as  SQU  triggered  bjr  certain  Unguistie  cosuftructlons. 
These  Capabilities  are  arranged  hierarchically.  For  example.  EasyThlk  supporta 
Computations;  Computations  can  have  single  or  multiple  operands;  Addition  is  a  Binary* 
Operand  Computation.  Each  Capability  has  a  corpus  which  demonstrates  the  functionality 
supported. 

« 

Conceptual  Aaalytis 

EasyTalk's  NL  processor  is  based  on  the  model  of  language  understanding  called  Conceptual 
Analyals.  Here,  the  attempt  is  to  get  directly  to  a  representation  of  the  meaning  of  a  query 
without  an  intervening  syntactic  analysis.  The  analysts  Is  managed  by  organizing  syntactic 
mid  semantic  expectations  at  the  level  of  individual  words  and  phrases.  (See  (CuDSS)  Car  a 
of  the  approach  and  its  methodologies.)  For  example,  the  query  "Show  average 
customer  balance"  is  in  the  corpus  for  the  lexeme  "average."  The  effect  of  the  processing 
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infonnatlon  stored  with  the  lexeme  is  to  mark  the  meaning  representation  as  «»ainng  for  a 
certain  kind  of  slngle>operand  computation.  At  the  same  time,  query  also  belotrgs  to  the 
Capability  corpus  for  Statistics  Queries.  As  in  all  database  retrieval  queries,  the  ulumaie 
meaning  of  the  query  depends  upon  the  database  struaure  of  the  application.  Thus  the 
capability  tested  will  also  depend  upon  the  structure  of  the  database  (or  the  user's  view  of  it). 

Testing  a  Conceptual  Analyser 

EasyTalk  Is  a  tool  for  ad  hoc  information  retrle'/al;  a  NL  query  is  formulated,  la  processed  into 
a  DBMS  query,  and  yields  data  formatted  Into  a  report.  Prom  the  standpoint  of  testing,  the 
LMF  supports  both  "black  box”  and  "glass  box"  evaluation.  From  the  "black  box”  view,  a  query 
is  correct  If  Its  report  is  correct.  A  "glass  box"  evaluation  is  supported  by  the  storage  of  several 
key  Intermediate  data  structui'cs.  for  example,  the  so-called  Initial  Meaning  Representation 
which  is  the  output  of  the  conceptual  analyzer  the  so-called  Final  Meaning  Representation, 
which  may  contain  augmentations  due  to  context  analysis;  the  results  of  database  structure 
ambiguity  resolution  rules,  and  so  forth. 

Within  the  LMF.  a  query  can  be  scored  as  either  correct  or  incorrect  If  incorrect  the  reason 
may  be  stated.  Since  a  query  may  appear  in  several  corpora,  it  is  important  to  prevent  a  query 
from  having  more  than  one  status.  Inter-sententlal  ellipsis  (follow-ons)  pose  a  problem  in  this 
regard,  since  a  given  follow-on  can  legitimately  be  associated  with  more  than  one  query,  and 
thus,  in  principle,  it  can  Inherit  the  prior  query's  status 

System  Fnnetionallty 

The  Lexicon  Malntenace  Facility  Is  composed  of  three  prlmaxy  ^sterns:  Lexicon  Control. 
Testing,  and  Maintenance;  and  two  supporting  systems:  the  Menu  Selection  System,  and  Usp 
Reporter-Interface. 

The  Lexicon  Control  System  performs  (bur  tasks: 

-  Iidbrm  the  user  of  the  contents  of  the  lexicon  upon  demand. 

-  Define  the  corpora  to  teat  the  General  English  lexicon.  Each  corpus  is  a  coUection  of 
sentences  that  exercises  the  functlotudlty  of  one  lexical  entry. 

•  MaLMtaln  these  corpora,  keeping  track  of  changes  to  them. 
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•  Assign  unique.  3ystem*wlde  Id's  to  query  sentences.  JrMntfy  these  queries  fonn  the 
Master  Test  Collection. 

The  Testing  System  performs  two  types  of  testing: 

•  Partial  Testing  of  the  lexicon.  The  natural  language  developer  may  need  to  test  a  portion 
of  the  lexicon  that  was  changed  or  recently  added.  The  developer  spedfles:  a  set  of 
corpora,  the  name  of  a  recently  modified  file,  a  set  of  natural  language  Capabilities 
(which  exercises  a  NL  or  DB  feature!,  or  a  collection  of  ad  hoc  queries. 

•  Complete  Testing  of  the  lexicon.  All  corpora  are  tested.  This  is  a  teat  of  the  Master  Teat 
Collection. 

To  support  the  Testing  System,  three  utilities  are  provided: 

•  Use  the  stored  records  and  indications  of  changes  to  enerate  the  collection  of  queries  to 
test. 

•  Run  a  log  against  this  collectloa 

>  Verify  that  the  queries  completed  succesafuUy*  this  is  done  by  the  Verification  Utility.  It 
does  this  by  checking  that  the  current  results  of  the  query  match  the  results  stored  as 
being  correct.  The  query  results  consist  of  the  Meaning  Representation  and  optionally 
the  reports. 

The  Maintenance  System  tracks  changes  in  the  results  from  the  Testing  System.  It  allows  the 
natural  language  developer  to  update  the  query  results  that  the  Verification  Utility  should 
consider  to  be  correct. 

The  Menu  Selection  System  allows  the  LMF  user  to  make  quick  and  efficient  use  of  the 
facilities  described  above.  The  menu  system  consists  of  a  (possibly  tangled)  hierarchy  of  dwlce 
menus.  These  choice  menus  allow  the  user  to  choose  either  another  menu  or  a  command  to 
execute.  If  a  command  is  chosen,  the  user  will  be  prompted  for  its  argurnema.  Addmonalty.  the 
system  will  only  allow  responses  of  the  correct  (user-defined)  type. 

The  Lisp-Reporter  Interface  generates  the  inputs  necessary  for  the  Batch  Reporter  to  generate 
reports  from  a  specification  of  either. 

a  set  of  corpora. 
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a  set  of  natural  language  Capabilities,  or 
a  collection  of  ad  hoc  queries 


and  out  of  that  collection  of  queries,  whether  or  not  reports  should  be  generated  for 
queries  that  generate  mql  that  have  been  specified  to  be  comet, 
queries  that  generate  mql  that  have  been  specified  to  be  mcorrect. 
queries  that  generate  mql  whose  correctness  has  not  been  specified, 
all  of  those  queries,  or 

queries  which  have  a  different  report  or  MR  than  is  what  stored. 

ConclnsiOBS 

The  LMP  has  been  m  use  for  almost  two  years.  During  that  time  Ea^alk  has  undergone 
explosive  enhancement.  We  would  probably  not  have  been  able  to  control  its  development 
without  this  facility,  and  thus  the  LMF  embodies  a  first  pass  at  the  kind  of  technology  that  la 
needed  to  first  create,  then  evaluate,  large  NL  systems. 
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1.  Introduction 

While  evaluation  is  in  full  swing  for  speech 
and  natural  language  understanding  (NLU),  with 
concrete  metrics  and  competitions  across  sites, 
evaluation  for  natural  language  generation  (NLG) 
has  remained  at  the  discussion  stage.  While  some 
may  argue  this  is  because  the  field  is  smaller  and 
less  mature  (and  perhaps  more  stubborn),  as  we 
see  it,  the  reason  is  that  evaluating  generation  is 
more  difficult  Not  only  is  it  hard  to  define  what 
the  input  to  a  generator  should  be,  but  it  is  also 
hard  to  objectively  judge  the  output 

In  this  paper  we  first  set  out  some  goals  for 
NLG  evaluation  and  point  out  potential  pitfalls. 
We  then  look  more  closely  at  the  panicular 
problems  for  evaluating  generation  systems. 
Finally  we  explore  some  shon  term  possibilides 
forev^uation  and  look  to  the  long  term. 

2.  Goals 

It  can  be  insmictive  to  draw  an  analogy 
between  evaluauon  and  exams  in  school.  As  we 
tell  students,  exams  are  both  to  show  them  their 
strengths  and  to  point  out  where  they  need 
improvement.  We  also  give  exams  so  that  we  can 
know  who  the  better  students  are,  who  is  paying 
anention  and  working  hard.  While  exams  succeed 
to  some  extent  in  meeung  these  goals,  they  also 
have  their  negative  side:  they  encourage  poor 
study  habits,  such  as  cramming,  and  they 
discourage  creative  problem  solving,  since  the 
best  grades  most  often  go  to  those  who  answer  in 
the  way  the  teacher  expects. 

Evaluation  for  generation  will  also  have 
positive  and  negative  sides.  The  main  goal  should 
be  to  show  us  which  techniques  arc  succeeding 
and  which  problems  still  need  work.  In  addition 
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(and  whether  we  intend  it  or  not)  evaluations  will 
make  comparisons  between  systems  and 
researchers  by  which  funding  agencies  can  rank 
projects;  this  real  competition  can  lead  to  the 
adoption  of  the  analogs  of  the  poor  study  habits  as 
researchers  direct  their  work  specifically  to  win 
the  competitions.  An  inescapable  effect  of 
evaluation  metrics  is  that  they  will  lead  the  field  to 
work  on  some  problems  over  others,  just  as 
exams  direct  students  to  study  some  aspects  of  a 
subject  over  others.  For  example,  a  generation 
task  requiring  an  exact  paraphrase  of  an  input  text 
will  discourage  work  on  lexical  choice,  since  a 
system  with  fewer  possibilities  is  less  likely  to 
make  an  error. 

To  avoid  the  negatives,  the  metrics  need  to  be 
designed  carefully,  so  that  they  do  not  encourage 
last  minute  hacking  and  so  that  the  judgements 
reward  real  progress  rather  than  showmanship. 

3.  Issues  in  Evaluation  of  Generation 

Systems 

One  of  the  major  problems  with  evaluating 
generation  systems  is  the  complex  relationship 
between  the  input  and  the  output  If  we  simply 
considered  the  output  and  attempted  to  do  a  post- 
hoc  evaluation  of  generation  systems,  we  would 
have  to  conclude  that  the  field  has  regressed  over 
the  last  15  years.  Consider  the  following  samples 
of  generated  text 

’The  game  started  with  my  taking  a  corner,  and 
you  took  an  adjacent  one.  I  threatened  you  by 
taking  the  middle  of  the  edge  opposite  that  and 
adjacent  to  the  one  which  I  just  taken,  bus  you 
blocked  it  and  threatened  me.  /  blocked  your 
diagonal  and  forked  you.  If  you  had  blocked 
mine,  you  would  have  forked  me,  but  you  took 
the  middle  of  the  edge  opposite  of  the  corner 
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which  /  took  first  and  the  one  which  you  had  just 
taken  and  so  /  won  by  completing  my  diagonal.” 
PROTEUS.  Davey  1974 

"You  know  for  some  reason  /  Just  thought  about 
the  bill  and  payment  again.  (You  shouldn’t  give 
me  a  bill.)  <Uh>  I  was  thinking  that  I  (shouldn’t 
be  given  a  bill)  of  asking  you  whether  it  wouldn’t 
be  all  right  for  you  not  to  give  me  a  bill.  That  is,  1 
usually  by  (the  end  of  the  month  know  the 
amount  of  the  bill),  well,  /  immediately  thought  of 
the  objections  to  this,  but  my  idea  was  that  I 
would  simply  count  up  the  number  of  hours  and 
give  you  a  check  at  the  end  of  the  month.” 
ERMA,  Clippinger  1974 

’’Assume  that  there  is  some  barber  who  shaves 
everyone  who  doesn't  shave  himself  (and  no  one 
else).  Call  him  Giuseppe.  Now,  anyone  who 
doesn’t  shave  himself  would  be  shaved  by 
Giuseppe.  This  would  include  Giuseppe  himself. 
That  is,  he  would  shave  himself,  if  and  only  if  he 
did  not  shave  himself,  which  is  a  contradiction. 
This  means  that  the  assumption  leads  to  a 
contradiction.  Therefore  it  is  false,  there  is  no 
such  barber.”  MUMBLE,  McDonald  1978. 

"Kennedy  only  got  a  small  number  of  delegates  in 
the  election  on  20  February.  Carter  just  lost  by  a 
small  number  of  delegates.  He  has  several 
delegates  more  than  Kennedy  in  total." 
PAUUNE,  Hovy  1985. 

"Little  dogs  chase  mice.”  "Floyd  bought  Helga  a 
book  to  read  on  the  plane.”  MUMBLE-86, 
Meteer,  McDonald,  et  al.  1987. 

"Knox,  which  is  C4,  is  en  route  to  Sasebo.  It  is 
at  79W  18E  heading  55W.  It  will  arrive  4124, 
and  will  load  for  four  days."  PENMAN,  Hovy 
1988. 

"Mon.  08-MAY  89  10:49AM  Abbie  is  at  Lotus 
Point,  which  is  at  125  Main  Street.  Her  skill  is 
managing.  Abbie  is  a  plant  manager.  Abbie  likes 
watching  movies.  She  watched  "The  Lady 
Vanishes"  on  SUN  07  May  7:20  pm." 
SPOKESMAN,  Meteer,  1990. 

'Diere  are  many  differences  between  the  early 
systems,  exemplified  by  the  first  three  examples, 
and  today's  systems,  exemplified  by  the  last  two, 
that  cannot  be  seen  by  inspecting  just  the  output 
text.  First,  the  range  of  texts  today's  systems  can 
produce  is  drastically  greater.  Clippinger's 


ERMA,  for  example,  only  produced  that  one 
para^ph,  whereas  SI^KESMAN  produces  text 
ter  eight  different  applicadons  and  many  different 
texts  for  each,  covering  most  of  what  the 
applications  are  capable  of  representing. 

3.1  Progress  in  NLG 

The  above  examples  make  it  clear  that 
measuring  progress  in  NLG  is  not  a  simple 
matter. 

On  one  dimension,  as  the  application  becomes 
more  complex,  the  more  possible  inputs  the 
generator  has  to  deal  with  and  the  more  complex 
structurally  those  inputs  will  be.  Generators  that 
ignore  some  of  that  complexity  may  produce  more 
fluent  text  by  "canning"  some  sets  of  decisions; 
however,  in  the  long  term  generators  will  have  to 
handle  the  complexity  in  order  to  accurately  reflect 
the  situadon/state  the  underlying  program  is  in. 

Related  to  this  is  how  closely  tied  the 
generator  is  with  its  underlying  program.  In 
systems  such  as  Davey’s  Proteus  CDavey  1976) 
and  Kukich's  Ana  (Kuldch  1988)  the  applicadon 
was  custom  designed  to  suit  the  needs  of  the 
generator.  This  close  fit  makes  very  fluent  text 
possible,  but  at  the  cost  of  markedly  more  work  in 
developing  the  applicadons,  and  the  result  cannot 
be  used  for  any  oi^er  applicadons.  In  general,  as 
generators  have  become  more  portable  and,  thus 
able  to  be  used  with  a  variety  of  applicadons,  the 
less  fluent  their  text  has  become.  In  Figure  1,  we 
label  this  dimension  "complexity  of  the  situation" 
for  the  number  of  different  situadons/states  the 
generator  can  handle  both  within  one  applicadon 
and  across  different  applications. 


Figure  One 

On  another  dimension,  as  the  number  of 
decisions  the  generator  ("consciously")  makes 
increases,  the  less  fluent  the  text  becomes,  at  least 
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in  the  early  stages.  Template  driven  generators 
are  doing  less  work,  since  many  decisions  are 
made  simultaneously  by  executing  the  template. 
The  extreme  of  this  is  a  print  statement,  which  is  a 
single  "decision"  and  can  produce  text  as  "fluent" 
as  the  programmer  desires.  Progress  in  this 
dimension  is  increasingly  less  stipulative 
components.  Early  generation  systems,  such  as 
Proteus  and  Mumble  didn’t  address  area?  such  as 
content  selection  or  text  planning.  As  these  and 
other  problems  are  taken  on  more  systematically 
in  modem  generators,  the  overall  competence  of 
the  system  initially  falls  off  and  then  gradually 
increases  as  the  new  component  becomes  more 
sophisdcated. 

For  example  in  Mumble-86  (Meteer,  et.al 
1987)  and  Nigel  (Mann  &  Matthiessen  1985), 
where  the  focus  was  on  linguistic  competence, 
typical  examples  were  single  sentences  (in 
contrast  to  the  full  paragraphs  of  earlier 
generators).  As  systems  such  as  Spokesman  and 
Penman  grow  today  (using  Mumble-86  and 
Nigel,  respectively,  as  their  linguistic 
components,  the  focus  is  on  text  planning,  and 
they  are  producing  paragraphs  again.  While  they 
are  not  as  fluently  as  the  early  systems,  the 
paragraphs  are  more  directly  motivated  by  their 
underlying  programs.  However,  until  these  text 
planners  become  more  sophisticated,  they  will  not 
fully  exercise  the  competence  of  their  linguistic 
components. 

As  Figure  One  shows,  progress  in  the  field 
can  be  seen  as  a  progression  from  the  simplest 
situation  and  a  single  decision  point  (such  as  a 
print  statement  in  a  compiler)  toward  composing 
text  for  complex  underlying  programs,  programs 
that  can  not  only  represent  a  great  deal  of 
information,  but  also  how  that  information  is 
more  and  less  salient  in  a  given  situation. 
However,  as  we  have  seen,  the  progress  will  not 
necessarily  be  reflected  in  the  output;  a  print 
statement  can  reproduce  Shakespeare. 

3.2  "Glass  Box"  Evaluation 

It  is  clear  that  in  order  to  measure  progress, 
we  need  to  look  at  more  than  the  output  of 
generation  systems.  In  a  "glass  box"  evaluation, 
the  goal  is  to  look  inside  a  system  and  evaluate  tlie 
individual  components.  At  first  glance,  it  would 
seem  easy  to  compare  generators  in  this  way. 
Nearly  all  divide  the  processing  into  two 


components,  linguistic  realization  and  text 
planning,  and  most  use  the  same  kinds  of 
knowledge  resources:  grammar,  lexicon,  plan 
libr^.  However  at  closer  inspection,  the 
similarity  of  these  terms  is  deceiving.  For 
example,  in  some  cases  lexical  choice  is  pan  of 
text  planning,  in  others  it  is  in  linguistic 
realization.  This  difference  also  has  effects  on 
what  information  is  in  the  lexicon;  is  it  simply  the 
words  and  their  inflectional  endings?  Are 
different  derivational  forms  in  the  same  entry  or 
different  entries? 

The  conclusion  of  the  AAAI-90  Workshop  on 
Evaluation  of  Generation  Systems  was  the  the 
field  was  not  yet  ready  for  glass  box  evaluation. 
We  need  to  first  be  able  to  describe/define  the 
generation  process,  and  to  do  that  we  need  to 
determine  the  space  of  decisions  in  the  process 
overall.  Currently,  there  is  litde  agreement  in  the 
field  on  what  the  decision  are,  how  alternatives 
should  be  represented,  what  control  structure 
determines  the  order  in  which  the  decisions 
should  be  made,  or  the  effect  of  a  decision  on 
subsequent  decisions.  Different  researchers  focus 
on  different  parts  of  the  generation  process  (e.g. 
text  planning  vs.  linguistic  realization)  and  take 
into  account  different  kinds  of  knowledge  (e.g. 
discourse  structure  vs.  user  models,  vs. 
taxonomic  domain  knowledge), 

Addressing  these  issues  within  the  generation 
community  is  the  topic  of  the  UCAI-91  Woritshop 
on  Decision  Making,  in  the  Generation,  Process. 
We  hope  that  this  and  subsequent  workshops  on 
the  topic  can  provide  a  firm  base  for  doing  glass 
box  evaluation  in  the  future. 


3.3  "Black  Box"  evaluation 

Given  the  complexities  of  evaluating  progress 
in  the  field  and  individual  components  of 
generation  systems,  an  alternative  is  treating 
systems  like  a  "black  box"  and  only  looking  at 
input/output  behavior.  This  is  at  the  approach 
taken  in  speech,  and  to  some  extent  in 
speech/language  systems.  In  this  section,  we 
discuss  the  viability  of  such  an  approach  for 
generation. 

3.3.1  The  input  to  generation 

The  most  obvious  problem  in  a  black  box 
approach  is  determining  what  an  appropriate  input 
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for  generation  is.  Researchers  in  generation 
define  the  boundaries  to  the  process  differently 
and  can  have  quite  different  requirements  on  what 
information  must  be  represented  even  at  the  points 
where  they  overlap.  Tbere  has  been  considerable 
effon  on  detennining  a  workable  common  input 
for  purposes  of  evaluation;  The  topic  was  one  of 
the  sessions  in  the  AAAI-90  V/orkshop  on 
Evaluating  Generation  Systems  and  it  was  a  panel 
session  at  the  1991  European  Workshop  on 
Generation  last  spring.  But  these  forays  are  only 
the  beginning  of  a  long  process,  and  concrete 
results  should  not  be  counted  on  for  evaluations  in 
the  near  term.  Determining  the  necessary 
properties  of  the  input  that  are  required  to  produce 
fluent  text  is  a  key  pan  of  the  generation  problem, 
and  hypothesizing  that  the  input  should  take  a 
cenain  form  is  pan  of  how  progress  is  made.  In 
this  light,  it  is  only  natural  that  different 
generation  projects  should  presume  different 
character  inputs,  and  it  would  be  wrong  to 
penalize  a  project  by  making  the  choice  arbitrarily. 

One  possible  away  around  the  problems 
created  by  stipulating  the  input  for  the  evaluation 
is  to  provide  each  project  with  the  source  code  to  a 
complete  application  program  and  have  them 
extract  whatever  input  they  happen  to  need  from 
that  program  and  give  it  the  representations  that 
they  use.  Unfortunately  this  approach  has  its  own 
deficits  since  it  is  not  just  the  representation  of  the 
input  that  projects  differ  on  but  the  amount  of  the 
information  it  supplies  and  the  even  ontological 
assumptions  behind  how  that  information  is 
conceptualized.  If  the  application  program  is 
taken  off  the  shelf  or  written  by  some  "neutral" 
third  party  then  much  of  what  will  be  tested  may 
not  be  generation  per  se  but  the  projects'  ab’Uty  to 
bridge  the  ontology  and  representational 
formalisms  of  the  off-the-shelf  system  to  their 
own  requirements,  augmenting  the  information 
that  the  program  supplies  with  more  information 
that  they  need.  This  is  a  practical  problem  in  the 
real  world,  and  one  that  we  could  consider 
evaluating,  but  it  would  not  be  an  evaluation  of 
the  generation  systems  per  se. 

3.3.2  The  output  of  generation 

A  second  problem  in  an  I/O  evaluation  is 
judging  the  output  of  a  generator:  it  is  very 
difficult  to  be  objective  in  evaluating  texts.  As 
teachers  of  college  composition  can  attest,  once 
the  students  no  longer  make  grammatical  and 


punctuation  errors,  the  line  between  a  "B"  and  a 
"C"  becomes  very  subjective.  You  cannot  easily 
point  to  what  is  wrong  with  a  composition  and 
there  is  no  simple  set  of  instructions  for  how  to 
correct  it,  as  there  is  for  a  calculus  exam. 

This  suggests  that  the  right  approach  for  an 
objective  evaluation  would  be  put  the  generator  in 
the  context  of  some  larger  system,  where  the 
generator  is  answering  a  question  (see  Section  3.3 
on  task  based  evaluation).  However,  now  only 
one  aspect  of  the  text  is  being  judged,  its  content 
In  order  to  judge  other  aspects  (fluency, 
effectiveness,  style),  we  are  going  to  have  to 
accept  non-objective  evaluations. 

This  is  not  to  say  that  subjective  evaluations 
cannot  be  fair.  Many  kinds  of  subjective  rulings 
are  handled  by  having  a  panel  of  judges  with  a  set 
of  criteria  that  each  contribute  a  score  to  the  final 
ruling.  This  is  accepted  practice  in  such  wide 
ranging  domains  as  essay  contents,  figure 
skating,  and  beauty  contests.  Since  most  people 
m  "experts"  at  judging  language,  using  panels  of 
judges  to  rate  the  output  of  generators  is  not  out  of 
the  question.  Note  however,  that  this  land  of 
evaluation  is  costly:  there  is  no  automatic  scoring 
program  that  can  read  these  texts  and  make 
judgements. 


3.4  Task  based  evaluation 

Given  the  difficulties  in  determining  the  input 
for  a  generator  and"  evahiarrng  the  output,  we 
might  consider  embedding  the  generator  into  a 
task  that  is  evaluatable,  and  then  evaluating  the 
performance  in  that  task.  This  is  the  approach 
taken  for  language  understanding,  both  for  speech 
language  systems,  where  the  task  is  database 
access,  and  message  processing,  where  the  task  is 
template  fill.  This  approach  avoids  the  problems 
of  evaluating  language  per  se;  however,  it 
introduces  the  problem  of  designing  the  task  so 
that  it  will  be  the  better  generators  that  will 
actually  perform  better  at  the  usk,  and  not,  for 
example,  the  ones  with  the  best  interfaces. 

Two  questions  arise  in  defining  such  as  task 
for  generators.  First  is  how  to  objectively  judge 
performance  without  having  to  judge  the  language 
per  se,  which,  as  we  pointed  out  earlier,  is 
subjective.  There  are  many  ways  to  effectively 
communicate  the  same  message,  so  enumerating 
the  possible  answers  is  not  in  general  possible. 
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An  alternative  is  to  define  a  task  in  which  the 
generator  produces  instructions  or  directions 
which  a  person  then  has  to  follow.  The  system 
could  then  be  judged  on  how  well  the  person 
follows  the  instructions/  directions.  Of  course, 
we  then  run  into  the  problem  of  credit  assignment: 
what  is  being  evaluated,  the  instructions  or  the 
person  following  them? 

4 .  Conclusiort 

So  far,  we  have  raised  more  questions  than 
we  have  provided  answers.  A  simple  black  box 
evaluation  for  generation  systems  requires 
defining  an  input  and  comparing  the  output,  both 
of  which  we  have  shown  are  very  difficult  for 
generation.  A  tasx  based  analysis,  which  avoids 
these  problems,  must  address  the  issue  of  credit 
assignment.  Does  this  mean  there  is  no  hope  for 
objective  evaluation  of  generations  systems  in  the 
near  term?  Perhaps.  As  in  the  examination 
process,  some  subjects  don't  adapt  well  to  exams 
with  "right"  and  "wrong"  answers.  Docs  it  mean 
that  we  cannot  compare  our  systems?  No.  The 
latest  methodology  in  teaching  composition  is  to 
get  the  students  to  work  together  to  informally 
critique  each  others  work.  In  generation,  this 
translates  to  putting  more  effort  into  hands-on 
"working  workshops"  such  as  the  AAAI-90  and 
the  upcoming  IJCAI-91  generation  workshops, 
and  less  effort  into  designing  formal  metrics. 
What  does  it  mean  for  the  long  term?  As  the 
students  go  on  to  higher  level  courses,  the  need 
for  such  objective  measures  decreases.  Advanced 
courses  have  essay  exams  or  require  only  papers. 
As  our  generators  and  the  applications  they  work 
with  mature,  we  must  put  effon  into  designing 
evaluation  techniques  that  let  us  quantitatively 
compare  the  effectiveness  of  our  systems. 
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1  Introduction 

Natural  language  generation  capabilities  are  an  important  component  of  many  cUssei  of 
intelligent  systems.  Expert  systems  rely  on  generation  facilities  to  produce  explanations 
of  their  knowledge  and  behavior  (e.g.,  [McKeown,  1988,  Moore  and  Swartout,  1989,  Paris, 
1990]),  intelligent  tutoring  systems  employ  generation  components  to  instruct  students, 
provide  hints,  and  correct  misconceptions  (e.g.,  [Cawscy,  1989]),  and  help  systems  employ 
generation  techniques  to  advise  users  about  how  to  achieve  their  goals  (e.g.,  [Wilensky  et 
a/.,  1984,  Wolz,  1990]).  In  these  contexts,  systems  must  produce  complex  multi-sentential 
texts,  e.g.,  justifications  of  results,  definitions  of  terms,  descriptions  of  domain  objects, 
instructions  about  how  to  perform  donoain  tasks,  and  comparisons  of  alternative  methods 
for  solving  problems  or  achieving  goals. 

In  the  field  of  expert  systems,  it  was  recognized  that  the  explanation  capabilities  of 
a  system  are  tightly  coupled  to  the  knowledge  base  and  reasoning  oompooent  of  that 
system.  Evaluations  of  early  expert  system  explanation  facilities  showed  that  many 
types  of  questions  users  would  like  to  ask  could  not  be  answered  satisfactorally  because 
the  knowledge  needed  to  justify  the  system’s  actions,  explain  general  problem  solving 
strategies,  or  define  the  terminology  used  by  the  system  was  simply  not  represented 
and  therefore  could  not  be  included  in  explanations  (Clancey,  1983,  Swartout,  1983]. 
However,  simply  improving  the  knowledge  bases  did  not  alleviate  all  of  the  explanation 
limitations;  knowledge  about  language  was  also  required,  see  (Moore,  1989].  This  led  to 
the  realization  that  the  range  of  user  questions  an  explanation  facility  is  able  to  handle 
and  the  sophistication  and  quality  of  the  responses  it  can  produce  depend  on  (at  least) 
two  knowledge  sources;  (1)  knowledge  about  the  domain  and  how  to  solve  probtems  in 
that  domain  as  represented  in  the  intelligent  system’s  knowledge  base,  and  (2)  knowledge 
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about  how  to  construct  an  adequate  response  to  a  user’s  query  in  some  communication 
medium  or  combination  of  media. 

As  intelligent  systems  become  more  sophisticated  and  include  capabilities  for  tailoring 
presentations  to  the  knowledge  and  beliefs  of  the  individual  user,  the  current  problem¬ 
solving  situation,  and  the  previous  discourse,  the  quality  of  the  natural  language  utter¬ 
ances  produced  relies  on  an  increasing  number  of  additional  knowledge  sources.  Par¬ 
ticipating  in  such  interactions  requires  methods  for  interpreting  users’  utterances  and 
actions,  methods  for  recognizing  users’  goals  and  plans,  strategies  for  choosing  relevant 
information  to  include  in  response  to  different  types  of  questions  in  different  situations, 
and  knowledge  about  how  to  organize  and  express  the  desired  content  in  a  coherent 
(multimedia)  presentation  tailored  to  a  particular  user  and  dialogue  situation.  Natural 
language  generators  for  such  systems  must  deal  with  a  wide  range  of  issues  including;  dis¬ 
course  management,  content  planning  and  organization,  planning  referring  expressions, 
and  choosing  grammatical  structures  and  lexical  items. 

In  general,  the  quality  of  the  responses  produced  is  dependent  not  only  on  the  “lin¬ 
guistic  capabilities”  of  the  natural  language  generator,  but  on  the  intelligent  system’s 
domain  model,  user  modeling  component,  dialogue  manager,  plan  recognizer,  etc.  This 
leads  to  a  fundamental  problem  for  those  interested  in  the  problem  of  evaluating  natural 
language  generation  facilities.  Two  possible  approaches  for  evaluating  the  generation 
facilities  of  such  systems  seem  appropriate:  (1)  Devise  a  set  of  evaluation  criteria  that 
can  be  applied  to  a  generation  component  in  isolation  and  systematically  and  objectively 
measured.  This  requires  an  identification  of  the  aspects  of  the  problem  of  presenting  a 
text  to  the  user  that  are  to  be  considered  the  responsibility  of  the  generation  facility 
and  the  types  of  inputs  and  contextual  factors  it  must  be  able  to  take  into  account.  (2) 
Design  experiments  which  allow  evaluation  of  the  natural  language  generation  facility  in 
the  context  of  the  larger  intelligent  system  of  which  it  is  a  part. 

In  this  abstract,  I  will  first  discuss  some  of  the  problems  I  see  with  the  first  approach. 
The  second  approach  involves  evaluating  user  satisfaction  and/or  performance  as  a  result 
of  the  natural  language  generation  facility.  Almost  any  set  of  evaluation  criteria  proposed 
will  include  criteria  such  as  “understandability”.  Because  such  criteria  can  only  be  judged 
by  human  users,  1  believe  that  all  evaluations  must  ultimately  involve  experiments  with 
human  subjects.  The  abstract  concludes  with  a  study  I  have  designed  to  evaluate  the 
natural  language  component  I  am  building  for  an  intelligent  tutoring  system. 


2  Methods  for  Evaluation 

2.1  Devise  Evaluation  Criteria  for  NL  Generation  Facilities 

I  see  three  main  problems  with  attempting  to  devise  a  set  of  evaluation  criteria  for  natural 
language  generations  facilities. 

1.  Where  to  draw  the  line.  One  of  the  first  issues  that  must  be  settled  is  to 
detemnine  exactly  what  aspects  of  the  problem  of  presenting  an  utterance  are  considered 
under  the  purview  of  the  generation  facility.  One  distinction  often  quoted  in  the  text 
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generation  literature  divides  the  generation  problem  into  a  strategic  component  which  is 
responsible  for  selecting  the  content  to  include  in  the  text  from  a  knowledge  base  and 
ordering  that  information,  and  a  tactical  component  which  is  responsible  for  producing 
grammatically  correct  sentences  expressing  the  chosen  content.  Decisions  such  as  where 
to  put  sentence  boundaries  and  choosing  lexical  items  are  typically  made  by  the  strategic 
component  or  by  an  interface  between  the  two  components.  While  many  would  agree  that 
the  capabilities  of  the  tactical  component  are  rightly  the  responsibility  of  a  generation 
facility,  there  is  much  less  agreement  regarding  the  strategic  decisions.  Should  tasks  such 
as  choosing  a  discourse  strategy  (e.g.,  analogy  vs.  definition),  content  planning,  planning 
referring  expressions,  and  lexical  selection  be  considered  the  jurisdiction  of  the  generation 
facility?  If  so,  it  becomes  more  difficult  to  separate  out  the  generation  facility  from  other 
parts  of  the  system  because  these  issues  are  affected  by  various  other  components,  e.g., 
the  knowledge  base,  user  model,  and  plan  recognizer.  If  these  issues  are  not  considered 
part  of  the  generation  facility,  then  many  of  the  most  important  factors  affecting  the 
quality  of  the  utterances  produced  will  not  be  evaluated. 

This  problem  may  be  equivalent  to  the  problem  of  identifying  what  the  inputs  to  a 
generation  facility  should  be.  Should  the  input  consist  of  intentional  goals  to  be  achieved, 
a  set  of  topics  to  be  expressed,  or  both?  How  should  knowledge  about  the  user  and  current 
context  be  provided  to  the  generator?  Until  we  make  some  headway  on  these  issues,  it 
will  be  difficult  to  devise  a  satisfactory  set  of  evaluation  criteria. 

2.  Evaluation  criteria  must  be  task  dependent.  One  of  the  problems  with 
attempting  to  decide  what  to  include  as  part  of  the  generation  facility  is  that  this  decision 
is  task  dependent.  For  example,  it  is  common  to  expect  that  a  generation  facility  used 
for  expert  system  explanation  should  select  the  content  to  be  included  in  an  explanation. 
However,  the  generation  component  of  a  machine  translation  system  need  not  select 
content  since  this  comes  from  the  parse  of  the  source  text. 

Recently,  Swartout  has  made  a  similar  observation.  He  has  argued  that  the  develop¬ 
ment  of  evaluation  criteria  for  a  generation  system  depends  heavily  on  how  that  system 
will  be  used  and  that  the  development  of  task-independent  criteria  for  evaluating  genera¬ 
tion  systems  is  very  difficult,  if  not  impossible,  as  not  all  criteria  are  relevant  to  all  tasks. 
For  example,  Swartout  argues  that  one  importimt  criteria  that  a  generation  facility  for 
expert  system  explanation  must  meet  is  that  of  fidelity,  i.e,  the  explanations  produced 
by  the  generator  must  be  an  accurate  representation  of  what  the  expert  system  actually 
does.  Clearly  such  a  criterion  would  not  be  appropriate  for  evaluating  the  generation 
component  of  a  machine  translation  system. 

Swartout  has  called  for  development  of  several  sets  of  criteria  customized  to  the  major 
uses  of  generation  systems  and  has  put  forth  a  set  of  desiderata  for  explanation  facili¬ 
ties  [Swartout,  1990].  These  criteria  place  constraints  on  the  the  explanations  themselves, 
the  mechanism  by  which  explanations  are  produced,  the  adequacy  of  the  expert  system’s 
knowledge  base,  and  the  effects  of  an  explanation  facility  on  the  construction  and  execu¬ 
tion  of  the  expert  system  of  which  it  is  a  component.  For  a  more  extensive  discussion  of 
the  criteria  and  their  implications,  see  [Swartout,  1990). 

3.  Evaluating  subjective  factors.  The  criteria  proposed  by  Swartout  are  very 


135 


comprehensive  and  are  quite  useful  as  qualitative  guidelines.  However,  it  would  be  desir¬ 
able  to  form  evaluation  metrics  from  these  criteria  with  objective  methods  for  assigning 
ratings  to  an  explanation  system,  in  some  cases,  the  task  of  devising  a  method  for  as¬ 
signing  a  value  to  the  metric  seems  straightforward.  For  example,  fidelity  can  be 
by  comparing  traces  of  the  system’s  problem-solving  behavior  with  the  natural  language 
explanations  it  produces.  Another  of  the  criteria  calls  for  low  construction  overhead, 
and  techniques  from  software  engineering  could  be  helpful  in  estimating  the  overhead  in 
system  construction  due  to  the  explanation  facility  and  how  much  savings  in  the  mainte¬ 
nance  and  evolution  cycles  are  due  to  design  decisions  attributable  to  the  requirements 
imposed  by  explanation.  Even  some  aspects  of  the  understandability  criterion  could 
be  objectively  measured.  For  example,  one  way  to  evaluate  the  factor  of  composability 
(smoothness  between  topic  transitions  in  a  single  explanation)  would  be  to  analyze  the 
system’s  explanations  to  determine  whether  they  adhere  to  constraints  governing  how 
focus  of  attention  shifts,  as  defined  by  Sidner  (1979)  and  extended  by  McKeown  (1982). 

However,  in  other  cases,  it  is  difficult  to  envisage  how  objective  measures  for  assess¬ 
ment  could  be  devised.  For  example,  how  can  we  assign  a  value  to  an  explanation’s 
naturalness  (linguistic  competence)  and  coherence?  Furthermore,  what  is  understand¬ 
able  to  one  user  may  be  obscure  to  others.  The  ratings  of  such  factors  are  inevitably 
subjective  and  can  only  be  judged  by  human  users.  The  most  promising  way  to  assess 
the  understandability  of  a  system’s  explanations  will  involve  techniques  that  attempt  to 
measure  users’  satisfaction  with  the  explanations  or  the  impact  of  the  explanations  on 
users’  performance. 

2.2  Evaluating  Generation  Facilities  in  Context 

The  purpose  of  a  generation  component  in  an  intelligent  system  is  to  facilitate  the  user’s 
access  to  the  information  and  knowledge  stored  in  that  system.  Thus  one  way  to  assess 
the  generation  component,  is  to  assess  the  impact  of  natural  language  utterances  on  users’ 
behavior  and/or  satisfaction  with  the  system.  This  can  be  done  using  direct  methods 
such  as  interviewing  users  to  determine  what  aspects  of  the  system  they  find  useful  and 
where  they  find  inadequacies,  or  by  indirect  methods  which  measure  users’  performance 
after  using  the  system  or  monitor  usage  of  various  facilities. 

Assessing  user  satisfaction.  One  of  the  best  sources  of  assessment  informatioa  is 
the  user  population.  Solidting  user  input  regarding  the  appropriateness  and  hdpfubcM 
of  the  natural  language  utterances  produced  by  a  system  provides  valuable  feedback.  In 
the  case  of  the  MYCIN  explanation  fadlity,  user  reports  of  inadequate  responses  led  to 
identification  of  limitations  in  expert  systems  and  inspired  research  efforts  to  address  the 
limitations. 

Monitoring  usage  of  natural  language  generation  facility.  Another  telling 
assessment  of  an  any  automated  tool  is  whether  or  not  users  actually  avail  themselves 
of  the  tool  and  whether  or  not  that  usage  is  successful,  i.e.,  they  are  able  to  get  the 
information  they  seek  or  are  able  tc  make  the  system  perform  the  task  they  desire.  Such 
an  evaluation  has  been  done  in  the  area  of  help  systems.  An  empirical  study  of  usage  of 
the  Symbolics  DOCUMENT  EXAMINER,  an  on-line  documentation  system  that  supports 
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keyword  searches,  indicated  that  a  substantial  number  of  interactions  with  the  system 
ended  in  failure,  especially  when  users  were  inexperienced  (Young,  1987}. 

Assessing  impact  on  users’  performance  on  task.  Another  way  to  assess  the 
contribution  of  a  generation  component  is  to  determine  how  the  natural  language  capa¬ 
bilities  of  the  system  contribute  to  users*  learning  or  effectiveness  in  using  the  system  to 
achieve  their  goals.  For  example,  if  an  explanation  component  is  included  as  part  of  an 
intelligent  tutoring  system*,  then  it  should  be  possible  to  design  a  simple  experiment  that 
assesses  the  explanation  component’s  contribution  to  learning.  Moreover,  by  varying 
the  explanation  strategies  employed  by  the  system,  we  can  design  studies  that  compare 
alternative  explanation  strategies. 


3  A  Proposed  Evaluation  Study 

I  am  currently  building  an  explanation  component  for  SHERLOCK  II,  an  intelligent 
coached  practice  environment  developed  to  train  avionics  technicians  to  isolate  faults 
in  a  complex  electronic  device.  Using  SHERLOCK  II,  trainees  acquire  and  practice  skills 
by  solving  a  series  of  problems  of  increasing  complexity  using  a  simulation  of  the  actual 
job  environment  in  which  these  skills  will  be  required.  In  SHERLOCK  n,  natural  language 
texts  are  generated  in  response  to  students*  requests  for  help  during  problem  solving  and 
also  in  the  Reflective  Follow-Up  Phase  (RFU)  which  allows  students  to  review  their  own 
problem-solving  behavior  and  compare  it  to  that  of  an  expert.  During  problem-solving, 
hints  review  the  student’s  steps,  explain  the  normal  function  of  a  component,  tell  the 
student  what  component  to  test  next,  or  present  a  method  for  testing  a  component. 
During  RFU,  students  can  ask  the  system  to  justify  its  conclusions  about  the  status  of 
components,  suggest  what  should  have  been  tested  in  what  order,  or  compare  alternative 
strategies  for  diagnosing  faults. 

Currently,  Sherlock's  natural  language  interaction  with  the  user  is  accomplished 
with  simple,  non-adaptive  strategies  using  canned  text.  The  system  does  not  have  a 
conceptual  model  of  what  it  has  said  to  the  user  and  previous  bints  affect  later  hints 
in  only  the  simplest  possible  way,  i.e.,  if  the  system  has  given  the  first  hint  attached  to 
a  particular  problem-solving  goal  and  the  trainee  asks  for  more,  the  system  gives  the 
next  hint  in  the  list  for  that  goal.  While  SHERLOCK  has  been  successful  in  field  testing 
[Nichols  et  al.j  iu  press],  SHERLOCK  project  members  feel  that  further  improvements  will 
come  from  enhancements  to  the  explanation  facility  and  a  more  sophisticated  student 
modeling  component. 

The  explanation  generator  I  am  building  plans  presentations,  using  text  and  graphics, 
from  a  set  of  strategies  that  are  being  derived  from  analyses  of  human-human  tutoring  in¬ 
teractions  in  this  domain.  By  adapting  my  previous  work  [Moore,  1989),  the  presentation 
planner  will  be  able  to  plan  explanations  tailored  to  the  individual  user,  problem-solving 
situation,  and  dialogue  context.^  The  explanation  generator  will  be  capable  of  answering 
follow-up  questions  in  context  and  elaborating  on  previous  explanations. 

^The  pisoner  makes  use  of  a  student  modeling  component  described  in  [Kats,  1981]. 
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I  will  test  two  hypotheses:  (1)  a  practice  environment  that  is  capable  of  provkling 
hints  and  explanations  is  better  than  one  that  simply  simulates  the  environment  allowing 
the  student  to  explore  with  no  explanatory  feedback,  and  (2)  an  adaptive  explanation 
facility  (capable  of  tailoring  explanations  to  users  and  situations,  providing  elaborations 
and  answers  to  follow-up  questions)  is  better  than  a  non-adaptive  facility  (using  canned 
hints  and  responses  to  questions)  in  terms  of  students’  satisfaction  with  the  system  as  well 
as  their  learning  of  the  troubleshooting  task,  retention  of  skills,  and  transfer  of  knowledge 
to  related  tasks. 

To  test  our  hypotheses,  we  will  run  a  study  in  which  three  groups  of  subjects  are 
compared:  Group  1  will  use  a  system  which  provides  no  hints  or  explanations,  Group 
2  will  use  the  existing,  non-adaptive  explanation  facility,  and  Group  3  will  use  the 
adaptive  explanation  facility  currently  being  built. 

Students’  satisfaction  will  be  assessed  using  direct  observation  and  interview  tech¬ 
niques.  Subjects  from  each  group  will  use  the  system  with  an  observer  present.  They 
will  be  instructed  to  make  any  comments  they  have  about  the  system  to  the  observer. 
The  observer  will  also  interview  each  subject  to  solicit  subject  opinion  about  the  appro¬ 
priateness  and  helpfulness  of  explanations.  Comparing  the  data  gathered  for  each  group 
will  allow  us  to  determine  which  type  of  system  is  favored. 

To  assess  learning,  subjects  in  each  group  will  work  through  a  sequence  of  problems. 
After  solving  each  problem,  they  will  engage  in  an  RFU  session.  Each  student’s  perfor¬ 
mance  on  solving  each  troubleshooting  problem  will  be  measured.  SHERLOCK  II  contains 
tools  for  automatically  monitoring  and  assessing  both  higher-level  (e.g.,  ability  to  choose 
next  component  to  test)  and  lower-level  skills  (e.g.,  ability  to  use  the  oscilloscope).  We 
will  make  use  of  these  facilities  to  measure  students’  overall  performance  as  well  as  perfor¬ 
mance  gains  during  the  problem  sequence.  If  our  hypotheses  are  correct,  we  will  expect 
subjects  in  Group  3  to  show  the  greatest  performance  gains  and  overall  score. 
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Abstract 

The  paper  describes  a  simple  method  for  ob* 
jectively  evaluating  the  compositionality  of  a 
transfer-based  Machine  Translation  system.  The 
question  is  the  extent  to  which  rule  interaction 
gives  rise  to  (unwanted)  side-effects.  An  exam¬ 
ple  is  given  of  the  use  of  the  method  in  the  con¬ 
text  of  the  BCI  (Bilingual  Conversation  Inter¬ 
preter),  an  interactive  transfer-based  bidirectional 
Machine  Translation  system. 


Introduction 

When  trying  to  evaluate  a  Machine  Translation 
system,  two  different  approaches  are  powible:  ei¬ 
ther  the  system’s  behaviour  in  its  propoaed  en¬ 
vironment  is  assessed,  or  the  theoretical  coverage 
and  worth  of  the  transfer  formalism  is  evaluated. 
The  first  type  of  evaluation  concentrates  on  trans¬ 
lation  quality  and  effectivess,  while  the  latter  seeks 
to  specify  which  linguistic  constructions  the  sys¬ 
tem  can  handle.  Most  work  in  the  field  have  been- 
concerned  with  system  behaviour;  here,  we  will 
concentrate  on  linguistic  coverage. 

In  the  literature  on  Machine  Translation,  a 
number  of  criteria  are  mentioned  u  signihcant 

'Pert  of  the  reeeMxh  deKribed  ta  thii  paper  wm  eiieo 
reported  on  at  ikt  Matinf  */  tkt  /aiemetienef  tVerSiaf 
Creep  ta  EttlttUtn  tf  M*cki»*  Treaifetiea  SttUmt,  Lee 
Resie,  Switterland,  April  1991. 

'The  work  report^  here  wee  funded  by  the  SwedUh 
lAiUtute  of  Conputer  Science,  and  the  sreater  part  of  it 
WM  carried  out  while  the  fourth  author  wm  employed  there. 


when  evaluating  the  worth  of  a  transfer  formalism; 
among  these  are  espresstveness,  stmplieity,  g ener- 
aliiy,  Ttvtrtihiliiy,  Itmfut/e-independenet,  mono- 
(onictfy  and  compostltons/ify.  Unfortunately, 
when  trying  to  convince  others  of  the  worth  of 
one’s  own  approach,  it  soon  becomes  evident  that 
most  of  these  are  not  easy  to  measure  objectively, 
if  they  are  not  absolute  properties  of  the  formal¬ 
ism.  (In  particular,  a  pure  unification-bssed  for¬ 
malism  is  guaranteed  to  be  monotonic).  To  say, 
for  example,  that  a  formalism  is  ’’good*  from  the 
point  of  view  of  expressiveness,  and  then  back  this 
up  with  five  carefully-chosen  examples,  is  not  re¬ 
ally  to  say  very  much. 

Compositionality,  however,  can  be  measured  ob¬ 
jectively.  Here,  we  will  describe  a  simple  method 
for  evaluating  the  compositionality  of  a  transfer- 
based  MT  system,  and  give  an  example  of  iu  use 
in  the  context  of  the  BCI  (Bilingual  Conversa¬ 
tion  Interpreter)  (Alshawi  ct  si  1&91),  an  interac¬ 
tive  transfer-ba^  bidirectional  system  currently 
being  developed  in  a  co-operation  between  SICS' 
and  SRI  Cambridge.  The  main  componenU  of  the 
BCI  are  English  (Alshawi  ed.  1991)  and  Swedish 
(Gambick,  Ldvgren  k  Rayner  1991)  versions  of 
the  SRI  Core  Language  Engine,  transfer  taking 
place  at  the  level  of  Quasi  Logical  Form  (QLF) 
(Alshawi  k  van  Eijck  1989);  the  trazufer  formal¬ 
ism  is  unification-based  and  bidirectional.  Our  ap¬ 
proach  to  Machine  Translation  is  aimed  at  keeping 
the  transfer  component  as  simple  as  possible,  while 
depending  on  fully  constrained  reversible  monolin¬ 
gual  grammars  for  correct  analysis  and  synthesis. 
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Measuring  compositionality 


Perhaps  the  most  important  factor  in  keeping 
transfer  simple  is  the  degree  to  which  the  tran** 
fer  relation  is  a  homomorphism,  i.e.  the  degree  to 
which  transfer  rules  are  compositional. 

For  compositionality  to  be  a  meaningful  notion 
in  the  first  place,  it  must  be  pouible  for  transfer 
rules  to  apply  to  partial  structures.  These  struc* 
tures  can  consequently  occur  in  different  contexts; 
other  transfer  rules  will  apply  to  the  contexts  as 
such.  The  question  is  the  extent  to  which  panic* 
ular  combinations  of  rules  and  contexts  give  rise 
to  special  problems.  In  a  perfectly  compositional 
system,  this  will  never  happen,  although  it  seems 
a  safe  bet  that  no  such  system  exists  today.  What 
we  want  is  a  method  which  objectively  measures 
how  closely  we  approach  the  compositional  ideal. 

Our  first  step  m  this  direction  has  been  the  con* 
struction  of  compo$ttion*litf  UHe$,  in  which  a  set 
of  rules  and  a  set  of  contexu  are  systematically 
combined  in  all  possible  meaningful  combinations. 
This  is  done  in  order  to  figure  out  the  extent  to 
which  the  complex  transfer  rules  continue  to  func* 
tion  in  the  different  contexts, 

In  the  following  three  diagrams,  we  give  an  ex* 
ample  of  such  a  table  for  the  current  version  of 
the  BCI.  Table  1  gives  a  set  of  rula,  which  exem* 
plify  six  common  types  of  complex  transfer.  Table 
2  gives  a  set  of  twelve  common  types  of  context 
in  which  the  constructions  referred  to  by  the  rules 
can  occur.  Finally,  Table  3  on  the  next  page  sunn* 
marizes  the  resulu  of  testing  the  various  possible 
combinations. 

To  test  transfer  compositionality  properly,  it  is 
not  sufficient  simply  to  note  which  rule/context 
combinations  are  handled  correctly;  alter  all,  it  is 
always  possible  to  create  a  completely  sd  hoc  so¬ 
lution  by  simply  adding  one  transfer  rule  for  each 
combination.  The  problem  must  rather  be  posed 
In  the  following  terms:  if  there  is  a  single  nde  for 
each  complex  transfer  t3rpe,  and  a  number  of  rules 
for  each  context,  how  many  cstrs  rules  must  be 
added  to  cover  special  combinations?  It  is  this 
issue  we  will  addrem. 


I  Table  J:  Tppea  complex  transfer  used  II 

Type 

Lxamplc 

Different 

particles 

John  likes  Mary 

John  tyckcr  om  Mary 

Passive 
to  active 

Insurance  is  included 
Fdrsakring  ingir 

■T«b 

to  adjective 

^ohn  owes  Mary  S20 

John  ir  skyldig  Mary  $20 

Support  verb 
to  normal  verb 

iohn  had  an  accident 

John  rkkade  ut  for 
en  olycka 

bingle  verb 
to  phrase 

John  wants  a  ear 

John  viU  ha  en  bil 
flit.:  “waaU  to  have”) 

Idiomatic 
use  of  PP 

John  is  in  a  hurry 

John  har  brdttom 

(lit.:  “hu  hurry”) 

h’  fHnafer  eontexU  ssed 

Example 

WH*question 

Whodoea  John  like? 

Vem  tyckcr  John  om? 

The  woman  that  John  likes 
Kvinnan  sora  John  tycksr  om 

Sentential 

complement 

■gjgprjU 

1  know  who  John  likes 

Jag  vet  vem  John  tycksr  om 

Jou  likes  Mary  today 

John  tycksr  om  Mary  idag 

Object 

raising 

I  want  John  to  like  Mary 

Jag  vill  att  John  aka  ty^  om 
Mary 

(1  want  that  J.  shall  like  M.”) 

Change 
of  aspect 

John  stopped  likiag  Mary 

John  alut^  tycka  om  Mary 
(-J.  stopped  lika-IKP  M.*) 
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! 

Table  3:  Composiitonehtf  Table 

(Sw'itsh-Engltsft  shown  above  Engluh-Swedtsh) 

! 

I  Transfer 

I  Different 

Active  to 

Verb  to 

Support  verb 

Single  verb 

Idiomatic 

'  context 

<  particles 

passive 

adjective 

to  norma!  verb 

to  phrase 

use  of  PP 

Present 

OK 

OK 

OK 

OK 

OK 

OK 

tense 

OK 

OK 

OK 

OK 

OK 

OK 

Perfect 

j  OK 

gcacrAtor 

OK 

OK 

OK 

QK 

1  tense 

'  OK 

OK 

OK 

OK 

OK 

OK 

1  pre»>not 

pr«**not 

preA*not 

pAAl-not 

prc»>oot 

(r«tu/cr 

pret-noi 

pr«»>noi 

pref*not 

pMt-not 

preA-noi 

traaaftr 

1 

OK 

OK 

OK 

OK 

OK 

OK 

I  question 

OK 

OK 

OK 

OK 

OK 

OK 

VV  H- 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

- 

. 

■■■ 

OK 

- 

OK 

- 

OK 

. 

Keutive 

OK 

OK 

OK 

OK 

OK 

cL'ise 

OK 

OK 

OK 

OK 

OK 

Sentential 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

OK 

question 

OK 

OK 

OK 

OK 

OK 

OK 

VP 

OK 

tttatitt 

OK 

OK 

OK 

OK 

OK 

traaaftf 

OK 

OK 

OK 

OK 

Change  of 

OK 

OK 

OK 

OK 

OK 

OK 

aspect 

OK 

OK 

OK 

OK 

OK 

OK 

Object 

traaifar 

ituktlv 

traanfar 

traaafw 

traaitfar 

iraBiTw 

raising 

OK 

OK 

OK 

OK 

OK 

OK 

E*ch  »quM«  m  Table  3  couiiu  of  two  eairiee,  the 
fif*t  for  the  Swedieb'Eoglish,  and  the  second  for 
the  Eoglieh-Swediah  direction.  The  entries  arc  to 
be  ir.t'^rprcted  as  follows; 


•  .  means  that  the  combination  was  not  appli* 
cule,  i.e.  that  the  construction  referred  to  by 
the  rule  cannot  occur  in  this  coat4Dct. 

•  OK  means  that  analysis,  transfer  and  goh 
eration  all  functioned  correctly,  without  any 
extra  rule  being  necessary  to  deal  with  the 
particular  context. 

•  fCBcrator  means  that  the  generator  eomp<y 
nent  was  unable  to  generate  the  correct  target 
language  sentence. 

e  transfer  means  that  the  transfer  component 
was  unable  to  make  a  correct  transfer. 

•  All  other  entries  are  names  of  rules  needed 
to  deal  with  s|:^ial  eombiaations  of  rak  and 
context.  For  this  table,  only  two  extra  rules 
were  needed;  pras>not,  which  reverses  the 
reliUive  scope  of  the  operators  for  negation 
and  the  present  tense  and  Dast«Bot,  which 
performs  a  similar  function  (or  the  past  tease 
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The  actual  results  of  the  tests  were  as  follows. 
There  were  136  meaningful  eombmatjons  (some 
constructions  could  not  be  passivized);  m  115  of 
these,  transfer  was  perfectly  compositional,  and 
no  e.xtra  ruU  was  needed. 

Of  the  remaining  21  rule/contexi/direction 
triples,  seven  failed  for  basically  uninteresting  rea* 
sons  the  combination  “Perfect  tense  +  Passive- 
to-active"  did  not  generate  in  English,  and  the  six 
sentences  with  the  object-raising  rule  all  failed  in 
the  Swedish-Engiish  direction,  since  that  rule  is 
currently  uni-directional.  The  final  fourteen  fail¬ 
ures  are  significant  from  our  point  of  view,  and  it 
is  interesting  to  note  that  ail  of  them  resulted  from 
mismatches  in  the  scope  of  tense  and  negation  op¬ 
erators. 

The  question  now  becomes  that  of  ascertaining 
the  generality  of  the  extra  rules  that  need  to  be 
added  to  solve  these  fourteen  unwanted  interac¬ 
tions.  To  reorder  the  scopes  of  tense,  negation  and 
modifiers,  and  account  for  the  scope  differences  be¬ 
tween  the  English  and  Swedish  QLFs  arising  from 
the  general  divergences  in  word-order  and  nega¬ 
tion  of  main  verbs  relevant  here,  two  rules  involv¬ 
ing  general  transformations  of  the  QLF  stricture 
were  added.  These  solved  ten  of  the  outstanding 
cases. 

The  four  bad  interactions  left  all  involved  the 
English  verb  (o  it;  these  were  the  combinations 
“Passive  to  active  -f  VP  modifier^  and  '‘Idiomatic 
use  of  PP  +  negation”,  which  failed  to  transfer  in 
either  direction.  Here,  there  is  no  genersl  solution 
involving  tbs  addition  of  a  small  number  of  extra 
rules,  since  the  problem  is  caused  by  an  occurrence 
of  to  it  on  the  English  side  that  is  not  matched  by 
an  occurrence  of  the  corresponding  Swedish  word 
on  the  other.  The  solution  must  rather  be  to  add 
an  extra  rule  for  t*eh  complex  irttfer  rtU  m  Hu 
rt/cvsnt  eUtt  to  cover  the  bad  interaction. 

Summarising  the  picture,  to  solve  the  epaeific 
examplee  in  the  teet  aet,  two  extra  rulas  were  thus 
requi^.  The  teste  revealed  that  all  bad  inter¬ 
actions  betwean  the  transfer  rules  and  contexts 
shown  here  could  be  tenaoved  by  adding  (bur  extra 
rulea  to  cover  the  124  poenble  interactions. 


Extending  the  framework 

It  should  be  pointed  out  that  the  componUonai- 
ity  table  presented  here  is  still  too  small  to  detect 
more  than  a  fraction  of  the  bad  rule  interactions 


that  may  occur  m  the  current  system.  .Most  im¬ 
portant  is  to  extend  systematically  the  set  of  con- 
texu,  taking  note  of  the  fact  that  many  of  the 
features  they  are  intended  to  represent  are  in  fact 
orthogonal  to  each  other. 

A  full  set  of  contexts  would  include  at  a  mini¬ 
mum  all  legal  combinations  of  independent  choices 
along  the  following  dimensions: 

•  Tense:  Present,  past  or  future. 

•  Mood:  Active  or  passive. 

•  ^'egoUen:  Positive  or  negative. 

•  Modtfieatien:  Unmodified,  PP  modification, 
ADVP  modification,  modified  by  fronted  con¬ 
stituent. 

•  Clante^ippe:  Declarative  sentence,  Y-N  ques¬ 
tion,  WH-queetion,  relative  clause,  eentential 
complement,  emb  .dded  question,  progressive 
VP  complement,  object  raising. 

Multiplying  out  all  the  choices  givee  a  total  of 
384  distinct  contexts;  this  must  then  be  nmlti- 
plied  by  the  number  of  transfer  rule  types  to  be 
tested,  and  doubled  to  get  both  directions  of  trans¬ 
fer.  With  the  ftgurm  given  above,  4608  sentences 
would  have  to  be  tested  in  practice,  of  course,  not 
all  combinations  are  possible.  Specifically,  pas- 
sivea  don't  interact  well  with  other  rule-contexts, 
leading  to  a  total  siM  of  the  test  set  of  3082  sen- 
tencee. 

Developing  the  software  support  needed  to  be 
able  to  run  testa  of  this  siie  regularly  ia  clearly 
not  a  trivisl  task,  bat  our  opinion  is  that  being 
able  to  do  ao  greatly  eontributea  to  maintaining 
the  system’s  reliability  and  integrity.  We  are  thus 
giving  high  priority  to  constructinf  the  neeemary 
tools  in  the  current  phase  of  the  project. 

Also  worth  noting  is  that  the  tsrta  described 
above  are  exclusively  at  the  etaUnca  level.  For 
complete  teats  of  the  compoeitionslity  of  transfer, 
one  would  have  to  construct  tert  a^emea  for  at 
least  tbs  noun  phrase  levsl,  as  well  The  compo- 
sitionality  tables  for  NPs  should  account  for  the 
intcraetioDs  (in  vaiioua  poritions)  of  different  NP- 
modifici*.  Tlus,  the  traaslisr  contexts  shonkl  b* 
something  like  the  ones  euggestad  ia  Thble  4  and 
the  itansfo  types  should  include  the  ones  given  in 
Table  5.  This  will  be  fnithai  studied  in  the  next 
pbaee  of  the  project 
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Conclusions 


References 


We  have  described  a  straight  forward  way  of  mea¬ 
suring  the  composuionality  of  transfer- based  MT 
systems  by  the  use  of  “compositionality  tables" 
We  ;i3!m  this  to  be  a  good  method  for  the  ob- 
evaluation  of  one  aspect  of  MT  systems, 
even  t.nough  the  tables  given  in  this  paper  should 
be  further  extended  to  capture-more  transfer  con¬ 
texts  and  tvpes  of  transfer  rules,  as  well  as  NP- 
structures 


lj  Table  4  -VP  transfer  contexts 

Transfer  context 

Example 

Plural 

car  parks 
parkeringsplatser 

Definite 

the  car  park 
parkeringsplatsen 

(jenitive 

car  park's 
parkeringsplatsens 

Prs-modified  by 
Adjective 

big  car  park 
stor  parkeringsplats 

Pre-modified  by 
Genitive 

his  car  park 
bans  parkeringsplats 

Post-modified  by 
PP 

car  park  here 
parkeringsplats  har 

Post-modified  by 
Relative  clause 

car  park  which  I  use 
parkeringsplats  som  jag  anvander 
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Table  5:  Complex  NP  transfer  types 

Transfer  type 

Example 

Adjective  Noun 

bad  luck 

to  Noun 

otur 

Noun  PP 

chairman  of  the  board 

to  Noun 

styrelaeordfornnde 

Uoun  Koun 

car  park 

to  Noun 

parkeringsplnts 

Past  Participin 

The  broken  cup 

to  Adjective 

Deo  tranign  koppen 

Adjective  to 

The  tuinfurabhi  car 

Relative  claose 

Bilen  som  into  kan  forsikras 

PP  to 

The  end  of  tbo  story 

Genitive 

Sagans  slut 
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1.  Introduction 

We  have  recendy  begun  work  in  machine  uansla* 
Oon  and  felt  that  it  would  probably  make  sense  to  start 
by  surveying  tJie  literature  on  evaluation  .As  we  read 
more  and  more  o.'i  evaluation,  we  found  Uiai  the  success 
of  an  evaluaiinn  often  depends  very  sunngly  on  the 
seiccoon  of  an  appropnaie  application.  If  the  application 
IS  well-chosen,  ti.cn  it  c''tcn  becomes  fairly  clear  how 
vhc  system  should  be  exaluaied.  Moreover,  tiie  evalua¬ 
tion  IS  likely  to  make  the  system  look  good  Con¬ 
versely.  if  the  applicauon  is  not  clearly  identified  (or 
worse,  poorly-chosen),  then  it  is  often  very  difficult  to 
find  a  satisfying  evaluation  paradigm.  We  begin  our 
discussion  with  a  bnef  review  of  ■  some  evaluaoon 
metrics  that  have  been  tried  in  the  past,  and  then  move 
on  to  a  discussion  of  how  to  pick  a  good  applicauon 

Why  work  on  machine  translaoon  now,  and  what 
kind  of  MT  is  most  likely  to  be  commercially  and 
theoretically  profitable?  Though  the  ALPAC  report  con¬ 
cluded  m  the  sixoes  tliat  there  should  be  more  basic 
research  in  MT.  it  susted  clearly  that  this  basic  research 
could  not  be  justified  in  icrrro  of  shortterm  return  on 
investment.*  In  paruculv,  when  compared  witJi  human 
capabilities  (sull  the  ultimate  test),  MT  systems  of  the 
time  were  not  deemed  a  success,  and  might  never  be. 

Tliis  belief  may  help  explain  the  resistance  of 
tn.viy  MT  researchers  to  take  evaluation  quesuons  seri¬ 
ously.  Tlic  FiUROTRA  project,  for  example,  cons¬ 
ciously  decidi-d  to  delay  evaluation  diKirtsions  as  long 
as  possible:  “Exact  procedures  for  evaluation  will  be 
decided  by  the  programme's  management  committee 

'  The  Ant  author's  ptimwenl  »6im»  m  AThTBtll  Lttnrv 
venei.  Mumy  Hill.  Ni 

>  "The  Cnmmiliet  rteemmendi  txpendiuiit*  i*  two  diibnct 
mat.  The  Ant  it  compuUUonil  lin|uiitin  ai  a  («rtof  lin|iM- 
Urt-  lUidiei  of  pwtinf .  tenrratioa.  tlntmrt,  itmantitt. 

tUliiUct,  and  quantiUlirt  liniuitbc  mallen,  includini  npen- 
mtnu  in  Iraniltlion.  with  imchint  aids  or  without  l,in|iiitiet 
(hould  be  lupportrd  m  tcitnet,  and  should  net  be  iud(td  by  any 
immediale  or  fer«se*able  eonuibuuon  to  praclKsl  Vamlstion... 

The  second  ina  it  improeeirant  of  |huinan|  tranaliiOon  jwitk 
respect  to  pracbctl  luurt  such  a  speed,  coat,  and  dvd'trl  ”  (B- 
ere*  tl  tl .  lOOA.  p,  3<) 


toward  the  end  of  each  phase..."  (Johnson  cl  al.,  1985, 
p  158)  Others  argue  against  any  human-related 
evaluauons  as  follows: 

Performance  of  operaiional  MT  systems  ts 
usually  measured  in  terms  of  their  cost  per 
1.000  words  and  their  speed  in  pages  per 
posueditor  per  hour  vs.  the  relative  cost  and 
speed  of  human  uanslation....  In  my  opin¬ 
ion,  it  is  becoming  increasingly  uninforma¬ 
tive  to  compare  the  performance  of  MT  sys¬ 
tems  with  that  of  human  translators,  even 
though  many  organizations  tend  to  do  that 
to  jusufy  their  MT  investments.  (Tucker, 

1987,  p.  28) 

We  believe  that  these  attitudes  hurt  the  cause  of 
MT  in  the  long  ntn.  As  is  proved  by  Uie  increasing  avai¬ 
lability  of  commercial  MT  and  MAT  systems  (such  as 
Systran,  Fujiuu’s  Atlas,  Logos,  IBM’s  Shall,  and  several 
others,  for  less  than  $100,000),  MT  today  is  beginning 
to  find  areas  of  real  (commercial)  applicability.  Thus,  to 
the  questions  "Has  anything  changed  since  ALPAC? 
How  can  one  build  MT  systems  that  make  a 
diriercnce?",  we  answer  that  the  community  needs  to 
find  evaluation  measures  and  applications  that  highlight 
the  value  of  MT  research  in  those  areas  where  systems 
can  be  employed  in  a  real  (and  economically  measuN 
able)  way.  Human  and  machine  trarsliZioo  abow  com- 
plemenlsjy  ttrengths.  In  order  to  design  and  build  a 
theoretically  and  praedeally  productive  MAT  system, 
one  must  choose  an  application  tiiat  expbiu  the 
strongths  of  the  mKhine  and  does  not  compete  with  the 
strengths  of  the  human.  Tlua  point  u  well  put  in  the  fol¬ 
lowing: 

"The  question  ttow  is  not  whetJier  MT  (or 
AI,  for  that  matter)  is  feasible,  but  in  what 
domains  it  is  most  likely  to  be  effective.... 

Ihe  object  of  an  evaluation  b,  of  course,  to 
determine  whether  a  system  permits  an  ade¬ 
quate  response  to  given  twtdi  and  con- 
siTMOta."  (Lehrbcrger  and  Dourbeau,  1888, 
p.lW) 
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\Nliat  jJifn  aft  appropriate  evaluation  ntcasures?  It 
would^te  ;hice  if  the  evaJii^ons  were  to  identify  those 
(aspects  of)  Nix  systems  that  that  make  them  suitable 
for.  arid ‘toch 'leer  them  towards,  High-payoff  niches  of 
functionaJity;.  Dot  in  spite  of  all  the  literature  bh  MX 
evalUnuph.  the  general  ev^uaoon  measures  that  are  pro- 
po-yed  ofiehifail  to  pinpoint  the  strengths  of  systems  and 
Ifad  them  toward  real  uUlity:  ir.»lead,  they  seem  to  con¬ 
found  impori^t  and  less  important  a<pects.  Xucker’s 
review  of  XaurhiMctoo  ^d  Metal,  for  example,  might 
give  one  the  misi^en  impression  that  both  systems 
work  about  equally  well  (namely,  approx.  80?o);3 

“Xaiifh-Metep  hau  been  operation^  since 
1977,  cf^lating  about  five  million  words 
ahnuaily  at  a  rate  of  success  of  80?o  without 
postediting."  (Xucker.  1987,  p.  31) 

"jT’iie  .Metal  system  is  reported  to  have 
achieved  bciween  tS^  and  85^0  ‘correct’ 
translations,  using  an  e.vpcri mental  base  of 
1 .000  pages  of  tc.\t  over  the  last  five  years." 
(Xucker.  1987.  p.  32) 

However,  these  numbers  do  not  accurately  reflect 
the  cruciid  difference  between  these  two  systems. 
Xaum-Meteo  is  generally  resarded  as  -a  fairly  complete 
solution  to  the  domairtresuicted  task  of  translau'ng 
weather  forecasts  whereas  Mttal  is  widely  regarded  as  a 
less  complete  solution  to  the  more  ambitious  task  of 
translating  unrestricted  text  The  evaluation  measure 
ought  to  be  able  to  highlight  the  strengths  and 
weaknesses  of  a  system.  Apparently,  the  “success  rate" 
measure  fails  to  meet  this  requirement,  presumably 
because  it  is  too  vague  to  be  of  much  use.^ 

Unfortunately,  tliis  failure  seems  to  be  characteris¬ 
tic  of  many  of  llie  Usk-independent  evaluation  metrics 
that  have  been  proposed  tlius  far.  Since,  in  our  opinion, 
the  blame  is  to  be  laid  on  the  desire  for  gencraliQr,  we 
propose  that  MX  evaluation  metrics  should  be  sensitive 
to  tiie  intended  use  of  the  system,  in  tills  paper,  we 
begin  by  outlining  metrics  that  have  been  proposed  and 
end  by  concluding  that  it  becomes  crucial  to  tlw  suectss 

*  Accordini  to  (r>*noMl  tomm«iic*tioa),  Mtito 

cuntniir  tchitvM  (I7%*actiti  on  a  velumt  of  X  milliat  words 
prr  jriv.  Th<  lAcrtattd  a*di>rnanrt  ii  Iwtrlj  dot  to  improTt- 
iTKAU  lA  tot  comimiAiniiea  lyiUm:  commuMciiioi  Aoii«  uMd 
to  U  n*|«Mitv|«  for  t  Isrfi  pmtMaf*  of  Uw  fiilitrM. 

*  The  iueem  r«u  of  a0%  rtporird  iA  (bsbtik,  IU4,  p.  S8&) 
preMdj  ihovid  not  bt  coi  .viand  wiUi  th«  Aumbrn  rtporud  for 
Mttil.  la  additioA  to  tranilatix  to*  iApul.  Milio  ilao  toMmpto 
to  dtttrmnr  if  tor  tnmIatiOA  ihould  W  ehrrkrd  ky  a  piofnawa- 
al  inmlalor.  ‘Hir  K|un  nporwd  is  (itslitlit,  ISM)  nfm 
to  Uir  frasboA  of  tor  lApui  dial  Metro  haadin  by  itatlf  witoout 
wuisiirrt  from  a  profiuioAiJ  tmulalor.  Th*  li(uni  rtpoittd 
for  Mttal  nftr  to  an  rvaluation  of  tot  coirtemm  of  Iht  cutpuL 


0^  ail  MX  elTofi  to  identify  a  high-payoff  ,5^^ 

tion  so  that  the  MX  system  will  stand  up  well  to  the 
eyaluaiion;  even  though  the  system  might  produce 
crummy  u^lalions. 

2.  ‘IVaditicaiaS  EvaJuarion  Metria 

2.1^  Sytam  baawf  Metriea 

We  identify  tliree  major  ^pes  of  evaluation 
metrics:  tyslenuitAcd,  ferf-baicd  and  co$t‘io$ed. 
System-based  rhetrics  count  inteiinal  data  resources  such 
as  tJie  number  of  words  in  the  lexicons,  rules  in  the 
grammars,  seniantic,  grammadeal,  or  lexical  features, 
tJte  number  of  reprneniation  elements  in  the  semantic 
ontology  or  Interiingua  (if  any),  and  the  nurnber  of 
translation  rules  (if  any).  Xhe  literature  contains  many 
examples  of  system-based  metrics,  for  instance: 

At  the  moment  them  are  about  sixty 
subgrammars  for  analysis  and  about  900 
rewriting  rules  in  total...  number  of  pewrib- 
ing  rules  for  transfer  and  generation 
processes  is  around  800,  artd  it  will  be 
increased  in  tlte  coming  few  months.  The 
dictionary  contains  about  16,000  iteim  at 
present,  and  will  be  increased  to  lOOOOO 
items  at  the  end  of  the  projeeL  (Nagao, 

1987,  p.  276) 

An  advantage  of  these  metriea  is  that  they  arc 
easy  to  measure,  which  makes  them  popular.  But  since 
these  metrics  are  tied  to  a  particular  system,  they  cannot 
be  used  very  effectively  for  comparing  two  systems. 
They  are  much  more  effective  for  calibrating  system 
growth  over  time.  ITie  major  disadvantage  of  these 
metrics  is  that  they  are  not  necessarily  related  to  utility. 

2.2.  Tkstt-baaed  Metrics 

2.2. L  Sentenes-Bawd  Metrics 

These  metrics,  the  moet  common  class,  are 
applied  to  individual  sentences  of  target  texta  by  count¬ 
ing,  for  example,  the  number  of  icntenees  semantically 
and  stylistically  correct,  the  number  of  sentences 
semantically  correct,  but  with  odd  tqrle,  the  number  of 
sentences  panially  semantically  correct,  the  number  of 
sentences  semantically  and  syntactically  incorrect,  and 
the  number  of  sentences  missed  altogether.  A  good 
example  appears  in  (Nagao  1986),  in  which  sen¬ 
tences  are  elaiwed  into  one  of  five  categories  ot  decreas¬ 
ing  intelligibility  and  into  one  of  six  categories  of 
decreasing  accuracy.  Another  exatngde  is  the  evalua¬ 
tions  developed  to  measure  the  letuiu  of  Eurotn  ays- 
terns  (see  Johnson  el  tl.,  198i). 
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Given  the  suhjecU'e  naiure  of  semantic,  syntactic, 
and  (especiaJiy)  stylistic  •‘correctness",  these  metrics 
art  impossible  to  make  precise  in  practice  In  addition, 
their  limitation  to  single  sentences  makis  them  too 
simplisuc  (for  example,  it  is  not  clear  how  to  scale  the 
metric  when  several  source  sentences  are  combined  in 
die  target  text,  or  when  parts  of  them  are  grouped  into 
sentences  differendy) 

2.2.2.  Comprehervibility  Metriei 

These  metrics  seek  to  mca<ure  translation  quality 
by  testing  the  user's  comprehension  of  die  target  text  as 
a  w'hole.  They  include  counting  die  number  of  texts 
translated  well  enough  for  full  comprehension,  the 
number  of  texts  in  which  enough  could  be  gleaned  to 
get  a  rcason.vlily  good  understanding  of  the  content, 
though  details  may  be  missing,  the  numl*or  of  texts  in 
which  some  conte.nt  could  be  gathered,  enough  to  tell 
whether  the  text  is  of  interest  to  die  u<cr  or  not,  the 
number  of  texts  with  fatal  inconststeneies  or  omissions, 
and  the  number  of  texts  missed  altogether 

These  evaluation  metrics  enjoy  some  significant 
advantages  First,  they  can  be  performed  by  the 
intended  user  of  the  translation,  requiring  little  or  no 
source  language  expertise.  Second,  they  take  in  stride 
the  mis*  or  even  non-transladon  of  text  due  to  certain 
relatively  isolated  phenomena  which  have  proven  very 
hard  to  handle  in  computational  systems  in  a  general 
way  (but  which  people  can  figure  out  themselves  fairly 
easily).  A  major  disadvantage  of  these  metrics  is  the 
difficulty  of  quantifying  diem.  One  approach  to  over¬ 
come  this  difficulty  is  to  create  comprehension  question* 
naires  that  measure  (in  SAT*lcsulike  manner)  how 
understandalile  translauons  are  to  their  intended  users 
with  respect  to  their  intended  uses.  An  example,  using 
a  test  suite  of  texts,  is  proposed  in  (King  and  Falkedal, 
1990).  A  second  approach  is  to  determine  how  willing 
users  would  be  to  pay  for  professional  translation  of  the 
text,  given  die  translated  version.  Since  profeaaionaJ 
translation  is  expensive,  die  usen  will  be  modvaled  to 
idenufy  the  mure  useful  systema. 

2.2.3.  Amount  of  Rat-EcSting 

Metrics  in  this  suhclass  are  based  on  die  amount 
of  work  required  to  turn  die  translated  text  into  a  form 
indistinguishable  from  a  human  translator's  eiTort  Wa)^ 
of  qu.vititing  diis  include  counting  the  number  of  edit* 
ing  keystrokes  required  per  page,  timing  die  revision 
process  per  page,  and  counting  tiie  percentnge  of 
machine*translatcd  words  in  final  text  An  example  is 
the  keystroke  count  reported  as  follows: 

"As  an  alternate  measure  of  die  system’s 

pcrfonii.mce,  one  of  us  corrected  evli  of  the 


stniences  in  die  last  three  eategones 
(different,  wrong,  and  ungrammattcal)  to 
either  the  exact  or  the  alternate  category. 
Counting  one  stroke  for  each  letter  that 
must  be  deleted  and  one  stroke  for  each 
letter  that  must  be  inserted,  778  strokes 
were  needed  to  repair  all  of  the  decoded 
sentences.  This  compares  with  the  1,918 
strokes  required  to  generate  all  of  the  Han* 
sard  translations  from  scratch."  (Brown  it 
ai,  1990,  p.  84) 

Some  researchers  object  to  keystroke  counting  because 
they  don't  believe  that  the  counts  are  eorrelaied  with 
uDiity. 

2.3.  Ca«*faMe<l  Meaauraa 

The  third  major  type  of  mearic  eoncenmiet  on  the 
system's  efficiency  in  producing  a  translation,  as  m; 

1.  cost  per  page  of  acceptable  translation  (machine, 
human,  or  mixed), 

2.  dme  per  page  of  acceptable  translation  (machine, 
human,  or  mixed). 

One  such  evaluadon  was  done  on  TaunvAviation  (Is*- 
belle  and  Bourbeau.  1085) 


Tetk 

Shthint 

Hnmgn 

Reparation  /  input 

$o.ou 

10.000 

Transisbon 

10079 

$0,100 

Human  revision 

10.088 

10.030 

Transenption  /  proofreading 

10.022 

10.015 

Total  (Can.  t  per  page) 

10.183 

$0,145 

Tlie  problem  with  cosubised  metrics  is  that  they  often 
don’t  make  the  systems  look  very  good.  .As  can  be 
noted  from  the  table  above,  the  evaluation  shows  that 
TaunwAviaUon  ia  Ktually  more  expensive  than  huntan 
translation  (HT).  If  one  wanu  the  system  to  look  good, 
it  is  iraponaot  id  pick  a  good  niche  application. 

3.  Oivaetvlitiai  of  a  Good  Nkfas 

We  believe  a  good  niche  appiicadoo  should  meet 
as  many  of  the  following  desiderata  as  posaibte: 

(a)  it  should  set  reasonable  expectations, 

(b)  it  should  mske  Mnse  ceonomiesily, 

(c)  it  should  be  sttnedve  to  the  intended  users, 

(d)  it  should  exploit  the  strengths  of  the  machine  and 
not  compete  with  the  stoengths  of  the  human, 

(e)  it  shooid  be  eiear  to  the  inera  wliat  the  system  can 
and  cannot  do,  and 

(f)  it  should  encourage  the  field  to  move  fcuwsrd 
toward  a  sensibie  long-term  gosl.^ 
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»  Mu;  lonfWrm  |Mto  Intt  W«i  s*oso*»<l  over  U*  jrtns. 

FAHQT  (ritlly-MtoiTMc  lu|k-ewli|r  truslatoi)  (B«*Hilltl, 


4.  Ebctenriw  PotuEditing  (£^)' 

An  Inappropriate  Nche 

li  is  not  rasy  to  identify  a  good  nich<-  application. 
One  cannot  simply  take  a  state-of-the-art  MT  program 
and  give  it  to  a  bunch  of  salesmen  and  expect  a  miracle. 
One  has  to  find  an  application  that  makes  sense. 

The  extensive  post  editing  (EPE)  application 
»ould  appear  to  be  a  natural  way  to  get  value  out  of  a 
•inie-of-the-art  MT  system.  But  unfortunately,  the 
appticauon  fails  to  meet  most  of  tiie  desiderata  proposed 
above. 

4.1.  (a)  Resdiilie  Expeetationi 

One  can  find  numerous  testimonials  in  the  litera¬ 
ture  that  sound  loo  good  to  be  true  (and  probably  are): 

"AJdiougli  you  can  expect  to  at  lea.*>l  double 
your  translaior's  output,  tiie  real  cosvsaving 
III  MT  lies  in  complete  electronic  transfer  of 
information  and  the  integration  into  a  fully 
electronic  publisliing  system."  (Magnusson- 
Murray,  1085,  p.  180) 

“Substantial  rises  in  translations  output,  by 
as  much  as  75  per  cent  in  one  case,  are 
being  reported  by  users  of  the  Logos 
machitte  translation  (MT)  system  after  only 
a  few  months."  (Lawson,  1984,  p.  6) 

"For  one  type  of  text  (data  description 
manuals),  we  observ-ed  an  increase  in 
tiiroughput  of  30  per  eent."  (Tschira,  1085) 

Statements  such  as  these  run  the  risk  of  setting 
unrealistic  expecUtions,  and  consequenOy,  in  tlie  long 
run,  it  is  possible  that  they  could  actually  do  more  harm 
than  good.  (We  discuss  the  dangera  of  unrealistic 
cxpectntions  in  section  7.)  If  users  could  really  expect 
even  modest  gains  in  productivity,  then  one  would 
expect  that  ETE  products  offered  by  ALPS,  Logos,  Sys¬ 
tran,  Weidner  and  others  would  stand  on  their  merits  in 
the  marketplar c,  and  would  not  need  all  tir  hype. 

4.2.  (b)  Coat  EffaetivaDaM 

In  fact,  careful  trials  appear  to  indicate  that  CPE 
i.v  actually  more  expenaive  than  human  translation  (IIT). 
Van  Slype  (1979)  estimated  that  E7£  cosU  475  Bfra. 
per  100  words,  almoal  twice  as  much  as  HT  (150-250 
Brfs.  per  lOO  wonts).  The  Canadian  government  found 
more  or  less  tltc  s.Tme  result  in  their  trial  of  Uie  Weidner 
product; 

"17)he  HT  production  chain  was 
significantly  faster  than  the  MT  production 

1900,  p  9t)  II  p'rt'iT*  OM  of  th»  moft  *«ll-kiio«s  pmpoMli 


chain.  How  much  faster  depends  on  which 
phases  of  the  MT  chain  are  counted.  If  we 
count  all  the  slept  on  the  log  form,  human 
translation  was  nearly  eviee  as  fast  ss 
machine  translation.  If  we  diseount  the  time 
that  the  machine  actually  lakes  to  tramlaie 
(on  the  assumption  that  the  participants 
could  use  this  time  to  do  other  useful  tasks), 
as  well  as  the  lime  for  the  second  dictionary 
update  (on  the  grounds  that  tlicsc  new  or 
modified  entries  are  not  intended  for  the 
current  text),  MT  remains  27%  slower  than 
HT.  If,  in  addition,  we  discount  the  time 
for  text  entry,  assuming  that  source  texts 
arrive  in  machine  readable  form  that 
Weidner  could  import,  MT  still  remains  5% 
slower  than  HT  for  all  the  texts  translated 
during  the  operational  phase  of  the  trial." 
(Macklovitch,  1991,  p.  3) 

Thus,  tliere  are  serious  indications  that  it  may  not 
be  commercially  viable  to  use  professional  translators  as 
post-editors.  In  facu  tliere  have  been  questions  about 
the  cost  cIT erliveness  of  the  EFE  application  dating  back 
to  the  ALPAC  report,  well  before  many  of  these  pro¬ 
ducts  were  introduced  into  the  marketplace:* 

"Die  postedited  translation  took  slightly 
longer  to  do  and  was  more  expensive  than 
conventional  human  translation...  Dr.  J.  C. 

R.  Lieklider  of  IBM  and  Dr.  Dull  Gaivin  of 
Bunke^Ramo  said  they  would  not  advise 
tlieir  companies  to  establish  aueh  a  service." 
(Pierce  e(  «/.,  196S.  p.  19) 

4.3.  (c)  AttraetivaneHi  to  Intaidad  Uacra 

In  addition,  EI^  has  failed  to  gain  mueh  accep¬ 
tance  among  llie  intended  target  audience  of  profeuional 
translators,  because  postediting  turns  out  to  be  an 
extremely  boring  and  tedious  chorc.^ 

"Most  of  tiie  transition  found  postetSting 
tedious  and  even  frusuatinf.  In  parlieular, 
they  complained  of  the  contort^  syntax 
prodiKed  by  the  machine.  Other  complaints 
concerned  the  excessive  number  of  lexical 
altemadves  provided  and  the  amount  of 
time  required  to  make  purely  mechanical 


•  “ni*  tail  tll»eiiwi*«  rf  ito  ETC  sntituios  is  4«cwc4  is 
my*  4il«l  is  Answks  14  oT  Iks  ALPAC  rtssrt.  Tfcs  trr**«s 
eto«fvi4  U>u  sMWdibsi  us*  Id  ‘'imftSt  *s  nfM  irMSlsam 
Slid  Miist  Uw  siow  visslsaws**  (Btfcs  u  sf^  ISPi.  s.  44|.  TWi 
woUii  luuwt  Uwl  ETC  prsdura  mifkl  W  mart  srwarriu*  tw 
cssu^  las  k;  as  smsusr  fsUi«r  tkss  as*  kj  a  HU^wsisaal 
’  Ptrinss  Ik*  auk  voaM  k*  t**s  tMkow  if  Ik*  isttrisn 
w*it  mad*  iwf*  l*xiki«  asd  man  aMrfitcBjy. 
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revisions  ’’  (fierce  tl  «/ .  1906,  p.  90) 

"Many,  but  not  ail.  tnmiaiors  decided,  after 
the  first  phve  of  the  MT  expenmrnl.  that 
S>stran  hu  not  a  translauon  aid.  iiccause 
th(>  found  that  it  took  too  long,  and  was 
too  tedious,  to  convert  raw  MT  into  a  tram* 
lauon  to  which  they  would  be  prepared  to 
put  their  name.'"  (Wagner.  198S,  p.  203) 

"^Vhen  asked  by  the  consultant  if  they 
would  like  to  conunue  working  with 
Weidner  on  the  same  texts  after  the  end  of 
the  trial,  not  a  single  p.vueipant  accepted." 
(Macklovitch,  1991.  p  4) 

.\fter  reading  Macklovitch's  description  of  some 
of  the  errors  in  (Macklovitch,  1986),  one  can  easily 
appreciate  w-hv  some  of  the  translators  would  be  frtis> 
traied  with  the  post-editing  task.  Macklovitch  observed 
that  approximately  half  of  the  errors  in  one  sample 
involved  the  overuse  of  French  articles.  In  translating 
an  English  noun  phrase  into  French,  it  it  a  pretty  good 
bet  that  the  French  noun  phrase  should  begin  with  an 
article  even  if  there  isn't  one  in  English.  However,  this 
rule  does  not  hold  in  ubies.  where  the  French  use  of 
articles  is  apparently  somewhat  more  like  English.  As  it 
happened,  one  of  the  texts  used  in  the  trial  contained  a 
very  long  list  of  crop  varieties  published  by  Agriculture 
Canada,  most  of  which  should  not  have  been  translated 
with  an  article.  Unfortunately,  the  Weidner  system  did 
not  know  that  noun  phrases  work  differently  in  tables, 
and  consequently,  the  post>editor  was  faced  with  the 
rather  tedious  task  of  deleting  the  article  and  adjusting 
the  capitalisation  for  each  of  the  crop  varieties  in  this 
very  long  list  The  professional  van.«laior  probably 
would  have  found  it  quicker  and  more  rewarding  to 
tran.<late  the  list  from  scratch. 

4.4.  Kaor't  Charaetmtation  of  EPB 

One  can  continue  to  go  through  tiie  list  of 
desiderata  proposed  above  and  find  even  more  reanons 
why  EPE  is  .*ui  inappropriate  niche.  Rather  tlian  beat  a 
dead  lione  ourselves,  we  thought  we  would  let  Martin 
Kay  do  it  for  us,  as  only  he  can; 

"Tlicre  was  a  long  period  for  all  I  know, 
it  is  not  yet  over  -  in  wliich  the  following 
comedy  was  acted  out  nightly  in  the  boweb 
of  an  American  government  office  witli  she 
aim  of  rendering  foreign  texts  into  English. 
ras.sages  of  innocent  prose  on  which  it  was 
desired  to  effect  tiiis  delicate  and  complex 
operation  were  subjected  to  a  {:*Dcess  of 
vivisection  at  the  hands  of  an 
unccmprelirnding  electronic  monster  that 
t/ansforiiicd  them  into  stammering  streams 


of  verbal  wreckage.  These  were  then  placed 
into  only  slightly  more  gentle  hands  for 
repair.  But  the  damage  had  been  done. 

Simple  tools  that  would  have  done  so  much 
to  make  the  repair  work  easier  and  more 
effective  were  not  to  be  had  presumably 
because  of  the  voraeioui  appefelt  of  the 
monster,  which  left  no  resouicea  for  any¬ 
thing  else.  In  fact,  such  remedies  as  could 
be  brought  to  the  tortured  remains  of  these 
texts  were  administered  with  colored  pencib 
on  paper  and  the  final  copy  wu  produced 
by  the  action  of  human  fingers  on  the  keys 
of  a  Qrpewriter.  In  short,  one  step  was  sin¬ 
gled  out  of  a  faily  long  and  complex  process 
at  which  to  perpetiatc  automabon.  The  step 
chosen  wss  by  fv  the  least  well  uitdentood 
and  quite  obviously  the  least  apt  for  this 
kind  of  treatment.''  (Kay,  1980,  ‘‘The 
IVoper  Place  of  Men  and  Machines  in 
Language  Translation,''  p.  2) 

5.  A  Oonstrudhro 

The  Workatalien  Approach 

Having  established  that  EPE  is  inappropriate,  Kay 
then  suggested  a  workstation  approach.  At  fint,  the 
woikstation  might  do  little  more  than  provide  word- 
processing  functionality,  dicaonary  access  and  so  on, 
but  ss  time  goes  on,  one  might  imagine  functionality 
that  begins  k>  look  mote  and  more  like  machine  tianalo- 
tion. 

"I  come  now  to  my  proposal.  I  want  to 
advocate  an  incremental  approach  to  the 
problem  of  how  machines  should  be  used  in 
language  translation.  The  word  apprsscA 
can  be  taken  in  its  original  meaning  as  well 
M  the  one  that  has  become  so  popular  in 
modem  technical  jargon.  1  want  to  advocate 
a  view  of  the  problem  in  whkh  maehinei 
arc  gradually,  almost  imperceptibly,  allowed 
to  take  over  certain  functions  in  the  overall 
translation  process.  Fint  they  will  take  over 
functions  r»t  essentially  related  to  tnutsla- 
lion.  Then,  little  by  little,  they  will 
approach  tiamiation  itself.  The  keynote  will 
be  modes^.  At  each  stage,  we  will  do  only 
what  we  krtow  we  can  do  rriisbly.  Linie 
steps  for  little  feed''  (lOigr.  1980.  p.  11) 

In  his  coneltidinf  remsiks,  Ksy  expressed  the 
hope  tint  hit  approach  be  impiemenlcd  b/  someone 
with  enough  "taite''  to  be  reaibtie  rntd  pngmatie. 

"The  tninsiaior’a  amanuensia  |workirtalion| 
wiM  im  niB  before  it  can  walk.  It  will  be 
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called  on  only  for  dial  for  «hich  its  masters 
have  learned  to  tnat  it.  it  will  not  require 
constant  infusions  of  new  ti  hoc  devices 
that  only  expensive  vendors  can  supply.  It 
is  a  framework  that  will  gracefully  aecoim 
modate  the  future  contributions  that  Ijnguis* 
ties  and  computer  science  are  able' to  make. 

One  day  il  will  be  built  because  its  very 
modesty  a.<$ures  its  success.  It  is  to  be 
hoped  that  it  uill  be  built  with  taste  by  peo> 
pie  who  undentand  languages  and  compuv 
ers  well  enough  to  know  how  liule  it  is  that 
they  know  ”  (Kay,  1980,  p  30) 

In  fact,  Kay's  approach  has  recently  been  imple¬ 
mented  by  people  who  understand  the  practical  realities 
well  enough  to  take  an  even  more  modest  approach  than 
Kay  himself  probably  would  have  taken.  CNVARC 
(Canadian  Workplace  Automation  Research  Center)  has 
undertaken  to  provide  the  Canadian  government's  Trans¬ 
lation  Bureau  with  a  translator's  workstation  that  could 
be  deployed  in  the  nea^term  to  the  bureau's  900  full¬ 
time  tran«laiots  (Mackloviteh.  1989).  For  obvious  prag- 
matie  considerations,  they  have  decided  to  tse  the  fol¬ 
lowing  off-the-shelf  components: 

(a)  a  PC/AT, 

(b)  network  access  to  the  Termium  terminology  data¬ 
base  on  CD-ROM, 

(e)  WordPerfect,  a  text  editor, 

(d)  CompareRite,  a  program  for  comparing  two  ven 
sions  of  a  text  file, 

(e)  TextSearck,  a  program,  for  making  concordances 
and  counting  word  frequencies, 

(f)  Mercury/  Termex,  a  program  for  maintaining  a 
private  terminology  database, 

(g)  Procomm,  a  program  providing  remote  acceu  to 
dau  banks  via  a  telephone  modem, 

(h)  Seconds  Memoire,  a  program  that  deals  with 
French  verb  conjugations,  and 

(i)  Software  Bridge,  a  program  for  converting  word 
processing  files  from  one  commercial  format  into 
anoUicr. 

Tikis  is  clearly  an  ide^  starting  point  for  inttoduc- 
ing  technology  into  the  Crartslator't  workpiKe.  Tliey 
will  liopcfutly  be  able  to  derronstrale  that  the  P&baaed 
workstation  is  clearly  superior  to  dictation  maehincs. 
After  diey  have  achieved  a  Irackrecord  of  success  and 
(lie  new  technology  has  been  in  place  for  a  while,  tiiey 
will  be  in  a  much  beUer  position  to  intnodiice  additional 
tools,  which  nugiit  be  more  exciting  to  i»,  but  also  more 
risky  for  the  nunagen  at  (lie  translsticn  bureau. 


One  might  imagine  all  kinds  of  exciting  looh. 
For  example,  the  worbution  could  have  $  “complete" 
key,  like  eoniiol-spaee  in  Cmacs,  which  would  fill  in  the 
rest  of  a  panially  typed  word/ phrase  from  context  One 
might  take  this  idea  a  step  further  and  imagine  that  it 
ought  to  be  able  to  build  a  supenfast  typewriur  that 
would  be  able  to  correct  typos  and  fill  in  context  given 
relauvciy  few  keysookes.  Peter  Brown  (personal  com¬ 
munication)  once  remarked  that  such  a  supenfMt  type¬ 
writer  ought  to  be  possible  in  the  monolingual  case 
observing  thst  there  b  to  much  redundtney  in  langusg 
that  the  user  should  only  have  to  type  a  few  characters 
per  word,  or  about  the  equivalent  of  1.3S  bitt  per  cha^ 
acter  (Shannon,  1951),*  which  b  only  slightly  more  than 
s  byte  (ascii  character)  per  English  word  on  average. 
The  user  should  have  to  type  even  Icm  in  the  bilingual 
case  because  the  source  language  should  provide  quite  a 
number  of  additional  bits  of  information. 

The  supc^fast  Q*pewriter  may  still  be  a  ways  off, 
but  we  ara  almost  already  in  a  position  to  provide  tome 
very  useful  but  less  ambitious  facilities.  In  particular, 
the  Translation  Bureau  currently  spends  a  lot  of 
resources  retiaaslating  mittor  revuions  of  previously 
translated  materials  (e.g.,  annual  reportt  that  generally 
don’t  change  much  year  after  year).  It  would  be  very 
useful  if  thett  were  some  standard  tools  for  aickiving 
and  retrieving  previously  translated  texts  ao  thu  the 
tnnslaton  would  have  aeeeaa  to  the  previous  transi»> 
tions,  when  appropriate.  It  it  also  becoming  possible  to 
use  bilingual  concordances  to  help  with  terminological 
issues. 

The  workstation  application  stands  up  to  the  six 
desiderata  proposed  above  much  better  than  the  CPE 
application.  It  is  (a)  much  more  realistic,  so  it  should 
have  a  bcuer  chance  of  (b)  economic  aueectt.  After  all. 
il  ought  to  be  able  to  beat  dictation  machines,  at  leant  in 
many  cases.  In  addition,  it  has  a  bcuer  chance  of  (c) 
being  attractive  to  the  intended  users  and  (d)  exploiting 
the  suengthi  of  the  machine  as  well  aa  lliosc  of  the 
human  since  it  b  being  developed  and  tested  by  profes¬ 
sional  tranalalots  at  the  request  of  a  translation  organita- 
tion.  S’ nee  it  it  so  modest  it  should  be  (e)  fairly  eictf 
wliat  it  can  and  cannot  do.  Finally,  (here  b  a  (f)  clear 
path  plan  toward  a  desirable  long-term  goal,  since  the 
strategy  cxpiicilly  calls  for  more  and  more  ambitious 
tools  as  time  goes  on. 


■  Shwukoa'i  mimsw  Ums  Erahsk  w  *M»fy  of  t.35  Wa 
(hiracur  i*  oeunMSc.  Is  wwtke.  om  »owld 

prabUitj  txsMt  s  KwtKU  eyewm  to  tw»f  m  imios;  wrnvIu* 
cloMr  to  ).7S  Via  cVvwkr  (Brows  «i  tt,  1961)^ 
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6.  Another  Cotwiruetivc  Si4gwtion; 

Appeal  to  the  En^UKr 

Hie  worlL<uuon  approarh  is  a  dirrci  appeal  to  the 
professional  translaion;  it  uses  the  bemTiia  of  ofTice- 
automauon  as  a  way  to  sneak  technology  into  the 
tran<laior's  workplace.  An  ajtemauvc  approach,  which 
also  seems  promising  to  us,  is  to  use  the  speed  advan* 
tages  of  taw  (or  almost  raw  )  MT  to  appeal  to  iJie  end* 
user  who  man>  not  require  high-quality. 

0.1.  Rapid  BaatoEditinc 

.After  noting  the  translaion;  were  unlikely  to  su^ 
port  the  EfC  application  because  they  are  unlikely  to 
choose  MT  over  HT,  Wagner  found  that  end-useis 
would  often  opt  for  crummy  qutck-and-dirty  translauon, 
if  they  were  given  a  choice. 

“We  there'ore  decided  to  use  Systoui  in  a 
different  way  -  to  provide  a  faster  transit 
tion  service  for  those  translation  users  who 
wanted  it,  and  were  willing  to  accept 
lower-quality  translation."  (Wagner,  198S, 
p.  503) 

The  output  from  Systran  was  passed  tlirough  a  'rapid 
postfcditing'  setviee  that  emphasised  speed  (4-5  pages 
per  hour)  over  quality.  When  the  proieet  was  first 
presented  to  the  translation  staff,  it  was  well-received 
and  13  out  of  35  volunteered  to  offer  the  rapid  posu 
editing  service  on  the  undentanding  that  they  could  opt 
out  if  they  did  not  enjoy  it  Wagner  found  that  “the 
option  is  popular  with  a  number  of  uaers  and  perhaps 
surprisingly,  welcomed  with  some  enthusiasm  by  CEC 
(Commission  of  the  European  Communities)  tratvlators 
who  find  rapid  post-ediong  an  interesting  challenge" 
(Hutchins,  1986,  p.  261). 

Wagner’s  rapid  post-editing  service  is  a  much 
beuer  ap  ation  of  crummy  MT  than  ETC  beeauie  it 
gives  all  parties  a  choice.  Doth  the  usera  and  the  tram- 
latora  art  more  likely  to  accept  the  new  technology, 
wane  and  ail,  if  they  arc  given  the  choice  to  go  back 
and  do  things  tlie  old-fashioned  way.  The  trick  to  being 
able  to  capitalise  on  the  speed  of  taw  MT  is  to  persuade 
both  the  iramlatoiB  and  the  end-users  to  aecept  lower 
quality.  Apparently,  the  end-users  are  more  easily  eoiw 
vinecd  titan  the  trarwlators,  and  therefore,  for  this 
approach  to  fly,  it  is  important  that  tlic  end-users  be  in 
the  position  to  choose  between  speed  and  quali^. 

6.2.  No  Bab-Editing 

The  Georgetown  system  was  used  extensively  at 
the  El'RATC^  Research  Center  in  Ispra,  Italy,  and  the 
Atomic  Energy  Commission’s  Oak  Ridge  Naeonal 
Laboratory  from  1003  until  1073.  IVaiaiations  wen 
dclivend  witlviut  pre-editin|  or  poat-editing.  In  1972- 


1073,  Boiena  Heniss-Doatert  (now  Bosena  Thompwn) 
conducted  an  evaluauon  and  concluded  ijiai  users  were 
quite  happy  with  taw  MT 

“The  users  presented  a  rather  satisfied 
group  of  customers,  since  96  percent  of 
them  had  or  would  ncommend  machine- 
translation  services  to  their  colleagues,  even 
though  the  texts  were  said  to  requin  almost 
twice  as  much  time  to  nad  as  original 
English  texts  (humanly-trartslated  texts  also 
were  judged  to  take  longer  to  nad,  but  only 
about  a  third  longer),  and  that  machine- 
translated  textt  wen  said  to  be  21  penent 
unintelligible.  In  spile  of  slower  service 
than  desind  and  a  high  demand  on  nading 
ume,  machine  translation  was  pnfemd  to 
human  translation  by  87  percent  of  the 
nspondenta  if  the  latur  took  ihne  times  » 
long  as  the  former.  The  nssons  for  the 
pnlennce  wen  not  only  earlier  access,  but 
also  tlie  feelings  that  the  ‘machine  is  mon 
honest',  and  that  since  human  labor  is  not 
invested  it  is  easy  to  diKard  a  text  which 
proves  of  marginal  intensL  Getting  used  to 
nading  machine-translaoen  style  did  not 
present  a  problem  as  evidenced  by  the 
answers  of  over  95  penent  of  the  nsporh 
denu."  (Henist-Doetcit,  1979,  p.  206) 

It  is  also  interesting  to  eompan  the  auiiudcs  of 
the  users  of  this  service  the  with  attitudes  of  the  Oansla- 
lors  mentioned  above.  Henisi-Doeterl  found  that  end- 
users  wen  generally  quite  supportive,  and  would  neom- 
mend  the  service  to  a  friend,  whenas  Maeklovitch 
found  that  professional  translators  wen  generally  un-- 
ling  to  condnue  using  the  service  themselves,  let  aloi. 
ncommend  the  icrviee  to  a  friend. 

"A  grateful  word  is  in  order  on  the  users' 
aaitudes,  who  wen  meet  cooperative  and 
friendly,  and  interested  in  what  wae 
involved  in  machine  tfinslalioa.  Tbey 
showed  their  familiarity  with  the  aberrations 
of  the  texts,  some  of  which  wen  considend 
quite  amusing  'eliasies',  c.g.,  ‘waterfalls' 
instead  of  'easeadet'  (the  incia  asked  that 
this  not  be  changed!).  Very  commonly,  and 
understandably,  they  wen  intertsled  in 
improvements  and  offend  many  sugges¬ 
tions.  An  example  of  an  esinme  atdiude 
on  the  part  of  one  user  in  Ihia  respect  wm 
that  of  'cheating'  on  the  questionnain  by 
giving  less  positive  answers  than  in  oral  dia- 
cusiiona.  When  lubaequently  aaked  about 
this,  he  nacted  with  aomelhing  like:  '1  use 
it  10  much,  I  want  you  to  improve  it,  and  if 
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I  show  thai  I  am  saosPied.  you  will  not  woifc 
on  a  any  more.”  (Heni»t*Doauit,  1979.  p. 

151) 

Why  vr  these  users  so  much  more  satisfied  with 
MT  than  the  tnnslalors  involved  in  the  Canadian 
govemmeni's  tnal  of  W’eidner?  W’e  believe  the 
difference  i«  the  appiicauon.  It  makes  sense  to  offer 
enchusers  the  opuon  to  trade  off  speed  for  quali^, 
whereas  it  does  not  make  sense  to  try  to  force  tmnsl*' 
ton  to  become  posuediton.  Consider  the  example  of 
the  crop  vaneties  mentioned  above.  Many  end*u5en 
might  not  be  bothered  too  much  by  tlie  extra  articles 
because  they  ran  quickly  skim  past  the  mbtakes,  but  the 
professional  translator  might  feel  quite  differently  about 
the  extra  articles  because  he  or  the  will  have  to  fix 
them. 

8.3.  Mort  Modett  AttennplB  to  Appeal  to  the  Ersd* 
Uer 

Consider,  for  example,  the  problem  of  reading 
email  from  other  countries.  Ifie  fint  autlior  currently 
receives  several  messages  a  day  in  French  such  as  the 
following:* 

Pour  repondre  aux  quesuons  de  Mauritio 
LANA,  j'ai  entendu  dire  de  bonnes  ehoses 
eoncemant  le  programme  ALPS  de  Alan 
MELBY.  Cett  au  nnoins  Ic  nom  de  sa 
societe  (ALPS)  qui  se  trouve  a  Provo  ou  a 
Orem  (Utah,  USA).  II  est  egalement  pro- 
fesseur  de  iinguistique  a  la  Brigham  Young 
Univenity  (Provo,  Utnli). 

It  might  be  possible  to  provide  a  tool  to  help  recipients 
whose  French  is  not  very  good.  Imagine  tiial  the  email 
reader  had  a  ‘‘cli(T>note"  mode  that  would  gloat  many 
of  the  content  words  with  an  English  equivalent: 

Pour  respondre  aux  questions  de  Mauritio  LANA, 

tfwvir 

j’ai  entendu  dire  dc  bonnes  ehoses  coneemant 

kttri  $4$  $444  tkm04  44M4mit$ 

Cliff-note  mode  could  be  tacd  at  a  way  to  ancak 
technology  into  the  email  reader,  juat  as  Kay’s  workst*- 
tion  approach  it  a  way  of  sneaking  teehnology  into  the 
translator's  workplace.  At  first,  clilf-nou  mt^  would 
do  little  more  than  table  lookup,  but  as  time  goes  on,  it 
might  begin  to  look  more  and  more  like  machine  trans¬ 
lation.  In  tlir  future,  for  example,  the  tyalem  might  be 
able  to  glow  the  phrase  It  anm  it  it  tttitU  as  Ike 
name  of  kit  tontptnji,  but  currently  the  system  would 
gloss  nom  as  ktktif  and  letitlt  as  letitlf,  because  these 
translations  are  more  common  in  the  Canadian  Hansards 


*  Tlx**  mrutan  wvill;  wnvt  vivhoui  acctMi. 


(parliamentary  debates),  whieb  were  uted  to  train  the 
system. 

Cliff-note  mode  stands  up  fairly  well  to  the  six 
desiderata,  (a)  It  sc»  reasonable  expectations,  (b)  It 
doesn't  cost  much  to  run.  (e)  It  ought  to  be  ataaelive  to 
usen.  After  all.  ihoae  who  don't  like  it,  don't  have  to 
use  it  (d)  It  it  well-positioned  to  intcf  rale  the  avengtlw 
of  tlie  machine  (vocabulary)  without  eompedog  with  the 
strengtha  of  the  user  (knowledge  of  fuiKdon  words,  syn¬ 
tax  and  domain  constnintt).  (c)  It  is  ao  simple  that  user 
shouldn’t  have  any  trouble  appreeiaring  both  the 
suengtha  aa  well  as  the  weaknesses  of  the  woid-fot«- 
word  approach.  Finally,  (f)  the  svategy  of  gradually 
introducing  more  and  tmre  teehnology  is  ideally  suited 
for  advancing  the  field  toward  desirable  long-term  goals. 

7.  Ccneltaion 

W'e  have  identified  six  desiderata  for  a  good  niche 
application.  Two  marketing  strategies  appear  to  meet 
these  six  desiderata  fairly  well: 

(1)  use  the  benefits  of  office-automation  to  sell  to  the 
piofesstonal  traittlstor,  or 

(2)  use  the  speed  advanuges  of  raw  (or  alntost  raw) 
MT  to  tell  to  the  end-user  who  many  not  require 
high-quality.** 

Ihe  diKimion  baa  atreeacd  piagtMdsm 
thirougbout  Ihe  speech  proccuing  communis,  for 
example,  hie  been  eomewhat  more  succcaeful  recently  in 
making  it  possible  to  report  crummy  results.  It  is  now 
quite  acccptsble  in  the  speech  communiqr  to  work  on 
very  resuicted  domains  (e.g.,  spoken  digits,  resouree 
management  (RM),  airline  traffic  information  system 
(ATIS))  and  to  report  performance  that  doesn’t  eompar 
witli  what  people  can  do.  No  oae  would  even  sugfut 
tint  a  machine  ahould  be  able  to  tceognitc  digits  aa  well 
u  a  person  could.  Bccsum  the  field  h«  taken  a  more 
realistic  approach,  the  field  now  has  a  fairly  good  public 
inwgc,  and  is  appearing  to  be  making  progress  at  a  rea¬ 
sonable  rale: 

"Slowly  but  surely,  the  technology  is  mak¬ 
ing  iu  way  into  the  real  world."  (Schwattt, 

1991,  Bttinut  Wttk,  p.  130) 


**  OUwr  roMiWIitM  ksfi  ilso  ktra  ttttmti  Is  tot 
XtNK  fw  csimstt,  ha  stoMMe  iiiywisivt  mslu  ty  isesdiK- 
ii«  t  mincit4  lMt«st  tale  tot  Oeeaimat  rwrmtoee  atfuite 
(tea  (Hackia,  ISM,  r.  3M).  Simrt  Sysltfm  ha  liae  arWti4 
Uw  at  at  a  mtrittri  laacagt  it  ansaiaalMa  toa  (tatritt 
uxi.  Untliae  tot  Sewaia  «  iMiktr  fenmla  far  aatta.  Tw 
elatit  tunoit  it  Mtia  (btotilt,  IM4).  UtfetUiaaltiy,  be«- 
t*tr,  it  H  ttiy  bwU  to  lad  ttqr  naaj  atotr  atowallyettariae 
limiitd  donuito  toa  rt«rlt  cwt  theal,  ad  toanaady,  tka 
ivattr  it  aaihtlj  to  ht  itrtatd  ttiy  laay  Satn  ia  tot  fatott. 
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Dut  tlicrc  was  a  umf  when  speech  reseaRheis 
were  much  more  ambtuou*  According  to  Klail’s 
review  (KJatt.  1977).  the  first  ARPA  Speech  l'nde^ 
standing  project  (Newell  tt  «l ,  1973)  had  tiie  objective 
of  obtaining  a  bieakthrougli  in  speech  undeistanding 
capabilit)  that  could  then  be  used  toward  the  develojv 
ment  of  practical  maivmacliine  communieauon  sysums 
Even  though  Harpy  (Lowerre  and  Reddy.  1980)  did  in 
fact  exceed  the  specific  goals  of  the  project  (e.g..  accept 
a  tliousand  word>vorabulao  connectet^speech  with  an 
artificial  syntax  and  semanurs  and  produce  less  than 
10^  senianuc  error  in  a  few  omes  real  umc  on  a  100 
mips  machine),  it  didn't  matter  because  Harpy  had 
failed  to  obtain  the  anticipated  breaktlirough.  And  eoi>> 
sequently.  funding  in  speech  recognition  and  understand¬ 
ing  was  dramatically  reduced  over  the  following  decade. 
U'hen  aetivit}'  was  eventually  resumed  many  years  later, 
the  community  had  learned  tliat  it  is  ok  to  strive  toward 
realistic  goals,  and  that  it  -'an  be  dangerous  to  talk  about 
breakthroughs. 

7.1.  Tha  GU  Ehiperunent 

Hie  experience  in  machine  translation  is  periiaps 
even  more  sobenng.  Hie  19^4  Georgetown  University 
(GU)  experiment  was  a  classic  example  of  a  success 
catastrophe.  In  Zareehnak's  1979  review  of  cariy  work 
on  machine  translation,  he  recalled  that  the  GU  experi¬ 
ment  was  originally  seen  as  a  huge  advance; 

"Hie  result  of  GU  machine  translation  was 
given  Wide  publicity  in  19S4  when  it  was 
announced  in  New  York.  Hie  announce¬ 
ment  was  greeted  by  astonishment  and  skep¬ 
ticism  among  some  people.  L.  C  Dostert 
summanied  the  result  of  tlie  expenment  as 
being  an  authentie  machine  tran.sl.xtion 
which  does  not  require  pre-editing  of  the 
input  nor  post-editing  of  the  output” 
(Zareelinak,  1979,  p.  S8) 

But  now,  we  can  look  bock  and  see  dial  die  I9S4 
GU  experimriit  probably  did  more  harm  dian  good  by 
setting  cxpeeinuona  at  such  an  unreaiisde  level  iial  they 
could  probably  never  be  met.  Ten  yean  after  the  GU 
experiment,  the  ALPAC  report  compared  four  dicit- 
current  systems  with  the  earlier  GU  experiment  and  sug¬ 
gested  that  there  had  not  been  much  progress. 

"Hie  reader  will  find  it  instructive  to  com- 
p.ve  the  samples  above  with  the  results 
obtained  on  simple,  or  selected,  text  10 
yean  earlier  (the  Georgclown-inM  Experi- 
rTKnt,  January  7,  1954)  in  that  the  earlier 
samples  are  more  ‘eodahle  than  die  later 
ones.”  (Pierce  et  tl.,  19CC) 


Zarechnak,  a  member  of  the  Georgetown  elfeUt 
complained  nuher  biueriy  that  the  comparison  was 
unfair.  In  realiQr,  the  1954  GU  experiment  had  been  a 
canned  demo  of  die  wont  kind,  whereas  the  four  sya- 
terns  developed  during  the  19C0s  were  intended  to  han¬ 
dle  large  quantities  of  prevtoualy  unseen  text 

"VtTien  ten  yean  later  a  text  of  one  hundred  . 
thousand  words  was  translated  on  a  com¬ 
puter  without  being  previously  examined, 
one  would  expect  a  certain  number  of  erren 
on  all  levels  of  operations,  and  the  need  for 
post-cdiong.  Hie  smnll  text  in  1954  has  no 
such  random  dau  to  uanslate.”  (Zarechnak, 

1979,  p.  56) 

In  fact,  the  ALPAC  committee  had  also  appteci- 
aifd  the  "toy"-ish  aspects  of  the  1951  GU  experiment, 
but  tiiey  did  not  feel  that  that  was  an  adequate  excuse. 
Hiey  criticised  boUi  the  1954  experiment  as  well  as  the 
four  systems  in  question,  the  former  for  setting  expeet^ 
tions  unrealistically  high,  and  the  latter  for  failing  to 
meet  those  expectations,  unrealistic  as  they  may  be. 

"Hie  development  of  the  eleetrenie  digital 
computer  quickly  suggested  that  machine 
tnvislaiion  might  be  pouible.  Hie  idea  ca^ 
tuitd  the  imagination  of  Kbotars  a^ 
adminisviton.  Hm  proctica)  goal  waa  siB»> 
pie:  to  go  from  machine-readable  foreign 
technical  text  to  UMfu)  English  text,  accu¬ 
rate,  readable,  and  ultimately  indistinguith- 
ablc  from  text  written  by  an  American 
scientist  Early  machine  tnnslaiiona  of  sim¬ 
ple  or  selected  text,  such  as  those  given 
shove,  were  as  deceptively  encouraging  « 
‘machine  translations’  of  general  scientific 
text  have  been  uniformly  diKOuraging." 
(Beree  tt  ti,  1906,  pp.  33-24) 

If  expectations  had  been  properly  managed  and 
the  waters  had  not  been  poisoned  by  the  1954  GU 
experiment,  it  is  poroible  that  we  would  now  look  back 
on  the  MT  effort  during  the  1900i  from  a  much  more 
positive  perspective.  In  fact,  one  of  the  four  systems  in 
question  later  became  known  as  Sgitran,  and  ia  still  in 
wide  use  lodiy.  In  this  senne,  early  work  on  MT  was 
much  more  successful  tlmn  early  work  on  Speech 
Understanding;  the  first  ARPA  Speech  Understanding 
TVojcct  did  not  produce  any  systema  with  the  came 
longeviQr  aa  Systran. 

For  some  reason  that  ia  dilfieult  to  understand,  the 
two  fields  currently  have  enlirrly  dilTcrent  public 
irrutges;  on  the  one  hand,  the  laymen  can  readily  rccog- 
nite  that  it  is  exuemely  difficult  for  a  machine  to  recog¬ 
nize  speech,  while,  on  the  otlter  hand,  even  the  manager 
of  a  tramlation  Mrvicc  will  blindly  accept  the  most 
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prepo$tcrous  pretfnsionn  of  pnctically  any  N!T  sales* 
nun.  Pcriiaps  we  can  change  tins  perception  if  we 
succeed  in  fociaiuf  our  aoeiiuon  on  good  applications 
of  sute*of*the*an  (i.e.,  cnimmy)  machine  translation. 
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L  Introduction 

There  ere  three  reeiont  to  perform  evaluetion  of  neturel  language  processing  software:  to  judge 
progress  in  the  field  as  a  whole,  to  judge  the  success  of  a  particular  theory  of  language  processing, 
and  to  judge  the  appropriateness  of  the  software  for  a  particular  application.  In  this  presentation,  I 
will  discuss  the  role  of  the  Natural  Language  Software  Registry  in  evaluation  efforts  aimed  at 
progress  in  the  field  as  a  whole.  Particular  software  and  theories  may  be  considered,  but  with  an  eye 
toward  establishing  a  base  of  quality  NLP  software  for  research  purposes.  Thus,  emphasis  will  be 
laid  on  properties  that  are  common  to  software  r^ardless  of  the  level  or  levels  of  linguistic  analysis 
being  performed.  Research,  engineering,  and  logistical  issues  affecting  software  reusability  emerge. 

The  Natural  Language  Software  Registry  was  recently  established  at  the  University  of  Chicago’s 
Center  for  Information  and  Language  Studies.  Its  purpose  is  to  facilitate  the  eichange  and 
evaluation  of  noncommercial  and  commercial  software.  The  Registry  is  sponsored  by  the  Association 
for  Computational  Linguistics.  Projected  Registry  activities,  pending  Rinding,  include 

•  solicitation  and  collocation  of  software  descriptions 

•  distribution  of  descriptions  using  print  and  electronic  media 

•  establishment  of  a  distribution  mechanism  for  software  not  otherwise  be  accessible 

•  coordination  of  detailed  reviews  of  noteworthy  software,to  be  published 

in  Corr.pattri  and  tht  Humanitits,  Computational  Linguistieo,  and  other  journals. 

•  participation  in  ongoing  software  evaluation  efforts 

The  initial  Usk  has  been  the  soliciUtion  of  reports  from  software  developers  both  academic  and 
commercial,  with  the  aim  of  coostnictinf  a  eondse,  uniform  summary  of  software  sources  and 
capabilities.  Such  a  summary  (Hinkelman  1991bl  serves  not  only  as  a  guide  for  researchers  in 
determining  where  to  direct  their  their  software  development  efforts,  but  also  as  an  index  of  the 
state  of  the  natural  language  processing  endeavor.  A  pilot  survey  of  33  software  items  (as  of  March 
’91)  has  enabled  us  to  better  understand  evaluatiott  criteria  that  can  be  applied  at  this  gross  level, 
and  supports  a  preliminary  asseument  of  the  state  of  natural  language  processing.  Ibis  gross- 
granularity  approach  to  evaluation  is  complemented  by  extensive  testing  and  review  cf  selected 
software.  This  paper  emphasises  metrics  that  reflect  on  the  reuseability  of  software,  for  purposes 
beyond  the  immediate  goal  of  the  designer.  It  describes  the  pilot  survey,  its  findings,  and  how  a 
particular  software  review  reflects  upon  it 
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2.  SoliciUtion  of  Software  Descriptione 

The  Registry  project  hes  been  extensively  publicized,  and  software  descriptions  solicited  at  ACL’90 
and  COLtNG’90.  ACH/ALLC'9i  [Hinkelman  1991a].  the  Finitt  String  NtwaletUr,  and  several 
electronic  bulletin  boards.  Respondents  conform  to  what  may  be  termed  an  "extroversion  factor": 
they  represent  commercial  ventures,  welbestabli^ed  community-minded  research  projects,  and 
active  electronic  bulletin  board  participants.  Commercial  enterprises  tend  to  omit  fields  reflecting 
system  inumals  and  potentially  negative  system  features  firom  their  software  descriptions.  Some 
established  research  projects  have  mandates  and  mechanisms  for  distributing  their  software;  the 
best  examples  of  this  are  national  projects  such  as  Alvey.  In  the  US  researchers  are  not  specifically 
supported  in  beta-testing  and  distribution  of  software,  but  projects  such  as  Penman,  Sneps,  and  Rhet 
have  a  record  of  doing  so.  They  tend  to  provide  very  detailed  descriptions.  Individual  electronic 
bulletin  board  participants  vary  greatly  in  the  modesty  and  number  of  hidden  asstunptions  of 
software  descriptions.  We  hope  to  achieve  more  participation  from  msjor  projects  in  the  future,  as 
the  notion  of  reusability  gains  widespread  acceptance. 

Software  registered  included  processors  for  spewh,  morphology,  syntax,  and  knowledge 
representation:  several  large  multicomponent  projects,  and  applications  software  in  the  areas  of 
spelling  checkers,  database  interfaces,  computer  aided  education  (poetry),  and  miscellaneous  tools 
for  linguists.  (Table  1.) 

Survty  questions  were  designed  to  reveal  the  capabilities  and  limitations  of  NLP  software,  along 
with  the  conditions  under  which  it  can  be  acquired.  They  prHumed  that  software  being  registSTed 
belonged  to  core  areas  of  natural  language  processing,  end  therefore  contained  some  biases  less 
appropriate  to  software  ultimately  classed  as  application  of  NLP  techniques.  The  survey  information 
breaks  down  into 

•basic  administrative  parameters  such  as  developers’  addresses 

•  conditions  on  availability,  such  as  fees,  licences,  and  support  provided 

•  description  of  system  goals  and  underlying  principles 

•basic  technical  parameters,  such  as  language  of  implementation 

•  design  features  and  test  set  size 


Table  L  Mejor  Software  Categoriee 

2  speech  signal  processing  systems 

3  morphological  analysis  programs 

5  syntactic  parsers 

1  knowledge  representation  system 

10  multicomponent  systems 

12  applications:  interlinear  text,  dialect  analysis,  spellchecking,  etc. 
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It  is  the  final  category  that  has  the  most  relevance  in  evaluation  of  software  quality.  It  encompasses 

•  separation  of  code  from  natural  language  data 

•  embeddedness  vs.  independence  of  major  modules 

•  extensibility  of  the  system 

•  technical  and  theoretical  limits  to  the  range  of  natural  languages  accommodated 

•  number,  size,  and  nature  of  test  sets 


The  gross-grain  approach  to  metrics  for  these  attributes  is  triage,  for  non-numerical  attributes. 
Items  are  divided  into  three  categories-  yes,  no,  and  an  intermediate  value.  The  interpretation  of 
the  intermediate  value  varies  according  to  the  parameter  being  evaluated.  It  may  indicate 
uncertainty  about  a  binary  parameter,  such  as  the  availability  of  source  code,  or  an  intermediate 
value  on  a  spectrum,  such  as  degree  of  independence  of  modules. 

The  gross-grain  metric  used  for  test  set  size  is  powers-of-ten,  made  comparable  across  systems  by 
assuming  that  there  are  approximately  ten  words  per  sentence,  and  at  least  ten  sentences  per 
message.  Provisionally,  a  "concept"  in  knowledge  representation  is  assumed  to  contain 
approximately  as  much  structure  as  a  sentence,  and  a  knowledge  representation  problem  about  the 
complexity  of  a  message.  This  metric  makes  it  possible  to  compare  systems  on  paper. 

3.  Qtudity  of  Software  Regiftered 

The  survey  results  using  these  metrics  are  given  below. 

3.L  Modularity 

Researchers  are  often  interested  in  linking  experimental  modules  with  other  NLP  components, 
especially  with  modules  that  generate  the  input  desired  for  the  experimental  module.  This  leads  to 
the  "cut  and  paste"  criterion  for  modularity,  which  asks  whether  major  processing  components  can 
be  extracted  and  used  in  combination  with  different  components.  (Data  is  another  issue.)  Extraction 
and  recombination  was  not  pouible  for  over  half  of  the  registered  descriptions. 

Some  of  the  failures  of  modularity  are  ’pragmatic  in  nature:  modules  are  packaged  behind  a  user 
interface  and  source  code  is  not  provided.  For  source  code,  the  triage  method  reveals  that  source 
code  is  definitely  available  in  one  third  of  the  eases.  The  seven  "maybe"  cases  include  one  piece  of 
source  code  that  is  exceptionally  expensive,  two  of  unresolved  status,  and  "cliqueware"-  available  to 
selected  collaborators  only. 

For  modularity,  we  find  that  multicomponent  systems  are  often  deeouplable  with  difficulty  or  not  at 
all-  a  shocking  fact  a  priori.  The  fact  is  that  there  is  ongoing  experimentation  in  the  types  of  control 
structures  that  relate  syntactic  and  semantic  processing,  and  in  semantic  representations.  Semantic 
representations  in  turn  vary  with  the  luttrue  of  back-end  tasks:  only  within  the  subarea  of  database 
intefisees  does  a  representation  (SQL)  emerge  as  the  equivalent  of  p-«ode  in  programming  language 
compilers.  Modularity  is  thus  an  issue  from  both  a  research  and  engineering  standpoint 

The  "no"  category  also  includes  several  applications  programs  which  are  simply  intended  to  be  run  in 
isolation.  One  solution,  as  for  PC  Kimmo,  is  to  provide  both  a  version  compiled  with  interpretive 
interface  and  one  compiled  as  a  library  function.  In  the  end,  it  is  possible  to  cut  and  paste  using  some 
software  not  described  as  modular,  by  special  arrangement  with  the  software  developer  or  because, 
in  the  case  of  Parlance,  a  runtime  interface  to  the  semantic  interpreter  is  available.  (Table  2.) 
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Table  2.  Modularity 


yes  maybe  some  no 


Source  Code 

12 

7 

14 

Modularity 

10 

€ 

14 

Cut  and  Paste 


12 


8  10 


3.2.  Extensibility 

Researchers  may  well  want  to  add  functionality  to  programs  they  acquire,  and  in  general  this 
requires  source  code.  The  exception  is  a  signal  processing  system  to  which  the  user  can  add  macros. 
The  more  extensive  systems  are  so  baroque  as  to  inhibit  extension  even  by  their  own  developers.  The 
number  of  "maybe"  answers  here  will  be  reduced  by  better  phrasing  on  future  questionnaires.  (Table 
3). 

3.3.  RjmgeofLanguuigef 

Researchers  may  wish  to  experiment  with  other  languages,  or  other  target  domains.  This  is  not 
possible  in  packaged,  commercial  systems,  nor  in  systems  such  as  STEMMA  whose  algorithms 
incorporate  information  specific  to  a  language.  In  general,  systems  are  designed  with  clear 
partitioning  of  code  and  data  (PROLOG  programs  are  considered  case  by  case.)  (Table  4.)  However, 
there  are  further  considerations  that  limit  the  retargetability  of  software  to  additional  languages  and 
domains.  Technical  issues  include  choice  of  character  sets  and  other  orthographic  assumptions:  in 


Table  3.  Extensibility  by  Pcogcaanec 


yes 

maybe  some 

no 

Source  Code 

12 

7 

14 

A  Priori  Extensibility 

14 

15 

5 

Extensibility 

0 

8 

18 
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particular,  the  vast  nuyority  of  the  registered  programs  are  limited  to  7-bit  ASCII  and  therefore  the 
Roman  alphabet  Theoretical  considerations  include  orientation  towards  specific  properties  of 
languages,  such  as  agglutination,  simple  inflectional  morphology,  cross-serial  dependencies,  and  so 
on.  Furthermore,  the  logical  form  into  which  some  multicomponent  systems  translate  sentences  may 
prove  to  be  optimised  for  the  natural  language  first  targeted  by  system  developers. 

3.4.  Test  Sets 

The  survey  collected  an  order-of-magnitude  report  of  the  size  of  test  sets  to  which  the  software  has 
been  applied.  Unfortunately  answers  did  not  always  distinguish  size  of  auxiliary  data  (lexicons,  rule 
sets)  from  size  of  input  data  sets;  these  are  noted  with  a  small  ’a’.  We  expect  that  the  ratio  of 
auxiliary  and  test  data  would  be  a  very  good  measure  of  system  robustness,  were  it  available. 
Likewise  unavailable  is  detailed  information  on  how  the  test  sets  were  chosen  or  the  actual  quality  of 
the  results.  (Table  5,  6.) 

Because  we  are  evaluating  the  field  as  a  whole  rather  than  individual  NLP  systems,  we  represent  the 
systems  with  a  capital  letter  rather  than  naming  them.  A  few  systems  have  been  tested  on  several 
languages,  as  noted.  For  some  systems  additional  information  was  available  or  could  be  inferred,  as 
noted  with  an  T.  For  instance,  althoui^  system  'O'  was  reported  as  performing  message 
understanding,  it  likely  does  not  handle  any  extrasentential  phenomena  and  therefore  would  be 
better  reported  in  terms  of  sentence  semantics. 

4.  A  Case  Study:  PC*Kimir.o 

To  data,  the  Registry  hat  completed  one  extensive  software  review  [Olsen  1991],  that  of  the  Summer 
Institute  for  Linguistics’  implementation  [Antworth  1990]  of  finite'Stata  morphological  analysis 
[Koskenniemi  1983, 1984].  The  review’s  conclusions,  based  on  an  attempt  to  apply  the  program  to  a 
large-scale  text  database,  are  compared  here  with  the  description  submitted  to  the  Registry  and  with 
two  other  implementations  [Karttunen  1990,  Genikorosidis  1988]  of  the  same  theory. 

The  self-description  in  the  Registry  correctly  reported  good  separation  of  code  and  data,  and  that 
major  modules  are  callable  (compiled  as  library  functions,  in  addition  to  the  interactive  interface.) 
Extensibility  was  reported  as  possible  by  modifying  C  code.  In  practiM,  while  the  modules  are  well 
chosen  from  the  point  of  view  of  a  descriptive  linguist,  it  was  difficult  to  modify  the  code  to  produce 
closely  related  modules  (recognition  without  a  lexicon,  generation  using  the  lexicon,  both  provided  by 
[Genikomsidis  1988].}  It  was  possible  to  make  other  modifieatioiu  with  relative  ease. 


Table  4.  Retasqotability  to  other  Languages  and  Donains 

yes  ma^oe  no 


Retargetabllity  21  5  8 
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The  reported  range  of  languages  handled  by  the  system  is  "any’*.  In  practice,  this  involves 
inelegancies  such  as  duplication  of  some  portions  of  the  lexicon.  Furthermore,  there  are  language 
groups  such  as  Algonkian  and  Eskimo  which  are  better  modelled  with  a  full  context-free  mechanism. 
Similarly,  there  are  arguably  rare  occurrences  in  English  that  would  require  a  context-free 
mechanism.  The  self-report  thus  makes  assumptions  which  even  its  authors  would  be  quick  to 
qualify,  but  which  would  not  be  obvious  to  an  outsider. 

The  reported  test  sets  were  nine  languages  at  10-100  lexicon  items.  The  reviewers  attempted  to  load 
a  French  lexicon  of  8000  words,  and  encountered  a  bug  (rather  than  a  design  limitation)  which  was 
promptly  fixed  by  SIL.  Coverage  of  French  was  then  limited  more  by  the  lexicon-use  requirement 
rather  than  the  provided  rule  set,  which  thus  remained  incompletely  tested. 

'The  system  had  one  large  disadvantage  that  was  not  detectable  from  the  self  report:  the  interface. 
Users  were  expected  to  enter  produ''tion  rules  as  state  tables  for  finite-state  machines,  a  tedious  task 
with  perhaps  .^ome  minor  instructional  value  to  the  novice.  This  is  true  for  several  other  versions  of 
the  technology  in  widespread  circulation  ([Genikomsidis  1988]  inter  alia),  due  no  doubt  to  the 
presence  of  this  deficiency  in  their  common  ancestor.  While  an  alpha  version  of  a  production-rule 
interpreter  has  since  become  available,  the  kimmo  family  of  programs  stands  testimony  to  the 
consequences  of  setting  a  bad  precedent 

The  self-report  was  thus  substantially  accurate,  even  modest  with  the  exception  of  theoretical  and 
interface  limits.  One  would  expect  this  to  be  typical  of  academic  systems,  whereas  one  would  expect 
better  interfaces  and  less  modest  claims  in  commercial  self-reports.  One  way  to  improve  self-reports 
would  be  to  include  a  small  I/O  sample.  Other  information  that  will  be  sou|^t  in  future  versions 
includes: 

-  number  of  work  years  the  project  represents 

-  some  questions  confute  auxiliary  data  sets  with  test  sets 

-  specifications  on  individual  major  modules  of  multimodule  s3rstems 

-  character  set  used  for  text  representation 

5.  Summary 

Although  the  gross  descriptions  provided  by  software  developers  are  limited  in  accuracy  and  detail, 
they  have  shown  us  the  state  of  the  practice  with  regard  to  several  important  parameters.  It  has 
pointed  out  the  failure  of  the  community  to  deal  with  orthography  in  any  general  way;  its  sueceu  in 
providing  seme  tools,  and  occasionally  broad  coverage;  and  the  need  for  ftirther  research  on  modular 
semantic  procestinf.  The  large-grain  metrics  and  parameters  used  are  perhaps  the  only  ones 
justified  by  our  method  of  collecting  developer-supplied  information.  Throuidt  the  life  of  the 
Registry,  we  hope  to  see  refinement  of  the  metrics  and  of  the  software  to  which  they  are  applied. 
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OF 

ROME  LABORATORY 

Rome  Laboratory  plans  and  executes  an  interdisciplinary  program  in  re¬ 
search,  development,  test,  and  technology  transition  in  support  of  Air 
Force  Command,  Control,  Communications  and  Intelligence  (C  I)  activities 
for  all  Air  Force  platforms.  It  also  executes  selected  acquisition  programs 
in  several  areas  of  expertise.  Technical  and  engineering  support  within 
areas  of  competence  is  provided  to  ESD  Program  Offices  (POs)  and  other 
ESD  elements  to  perform  effective  acquisition  of  C  l  systems.  In  addition, 
Rome  Laboratory’s  technology  supports  other  AFSC  Product  Divisions,  the 
Air  Force  user  community,  and  other  DOD  and  non-DOD  agencies.  Rome 
Laboratory  maintains  technical  competence  and  research  programs  in  areas 
including,  but  not  limited  to,  communications,  command  and  control,  battle 
management,  intelligence  information  processing,  computational  sciences 
and  software  producibility,  wide  area  surveillance/sensors,  signal  proces¬ 
sing,  solid  state  sciences,  photonics,  electromagnetic  technology,  super¬ 
conductivity,  and  electronic  reliability/maintainability  and  testability. 


